Skip to content

@tank/kubernetes-mastery

1.0.0

Description

Production Kubernetes operations and architecture. Covers workloads (Pods, Deployments, StatefulSets), networking (Services, Ingress), Helm/Kustomize, RBAC, Pod Security Standards, storage, autoscaling (HPA/VPA), health probes, kubectl debugging, observability (Prometheus/Grafana), and GitOps (ArgoCD/Flux).

Triggered by

kubernetesk8skubectlhelmdeploymentingress
Download
Unsafe
tank install @tank/kubernetes-mastery

Kubernetes Mastery

Core Philosophy

  1. Declarative over imperative -- Define desired state in YAML manifests. Let controllers reconcile actual state. Never rely on kubectl run in production; commit manifests to Git.
  2. Least privilege by default -- Every workload gets its own ServiceAccount with minimal RBAC. Run as non-root. Drop all capabilities. Apply Pod Security Standards at namespace level.
  3. Resource-aware scheduling -- Always set CPU/memory requests (scheduler guarantee) and memory limits (OOM protection). Omit CPU limits for latency-sensitive workloads to avoid throttling.
  4. Probes are self-healing -- Configure startup probes for slow-init apps, readiness probes to gate traffic, and liveness probes to restart deadlocked processes. Aggressive liveness probes cause restart storms.
  5. GitOps is the deployment model -- ArgoCD or Flux syncs cluster state from Git. Manual kubectl apply is for emergencies only. Every change is auditable and reversible.

Quick-Start: Common Problems

"My Pod is stuck in CrashLoopBackOff"

  1. Check exit code: kubectl describe pod <name> -- look at Last State and Exit Code
  2. Read logs: kubectl logs <name> --previous (shows last crashed container)
  3. Exit code 137 = OOM killed -- increase memory limit
  4. Exit code 1 = application error -- fix the app
  5. Liveness probe failing? Check if probe path/port is correct and timeout is sufficient -> See references/observability-and-debugging.md

"Which Service type should I use?"

ScenarioService Type
Pod-to-pod within clusterClusterIP (default)
External access via cloud LBLoadBalancer
External access without cloud LBNodePort + Ingress
Headless (direct pod DNS)ClusterIP with clusterIP: None
External database/APIExternalName or Endpoints
-> See references/networking-and-services.md

"Helm or Kustomize?"

SignalUse
Packaging for distribution (charts)Helm
Environment-specific overlays (dev/staging/prod)Kustomize
Need templating with conditionals/loopsHelm
Prefer pure YAML, no templating languageKustomize
Both -- Helm for third-party, Kustomize for in-houseCommon hybrid
-> See references/helm-and-kustomize.md

"How do I set up autoscaling?"

  1. Set resource requests on all containers (HPA needs metrics to compare against)
  2. Deploy Metrics Server (kubectl apply -f metrics-server.yaml)
  3. Create HPA: kubectl autoscale deployment <name> --min=2 --max=10 --cpu-percent=70
  4. For custom metrics (queue depth, RPS): use Prometheus Adapter + HPA v2
  5. Add Cluster Autoscaler for node-level scaling -> See references/autoscaling-and-resources.md

"My Deployment rollout is stuck"

  1. Check status: kubectl rollout status deployment/<name>
  2. Check events: kubectl describe deployment/<name> -- look for FailedCreate
  3. Insufficient resources? Scale down or add nodes
  4. Image pull error? Verify image name, tag, and imagePullSecrets
  5. Rollback: kubectl rollout undo deployment/<name> -> See references/gitops-and-deployment.md

Decision Trees

Workload Controller Selection

Workload TypeController
Stateless web app, APIDeployment
Database, distributed storeStatefulSet
Per-node agent (logging, monitoring)DaemonSet
One-off batch processingJob
Scheduled batch processingCronJob

Security Hardening Priority

PriorityAction
1 (Day 1)Dedicated ServiceAccounts, no default SA
2 (Day 1)Pod Security Standards: warn then enforce restricted
3 (Week 1)Default-deny NetworkPolicies per namespace
4 (Week 1)RBAC audit -- remove wildcards and ClusterRoleBindings
5 (Ongoing)Secrets in external store (Vault, ESO), not plain manifests

Storage Selection

NeedSolution
Shared config filesConfigMap (mounted as volume)
Credentials, API keysSecret (+ External Secrets Operator)
Database storagePVC with StorageClass (retain policy)
Shared filesystem (multi-pod)ReadWriteMany PVC (NFS, EFS, CephFS)
Ephemeral scratch spaceemptyDir

Reference Index

FileContents
references/workloads-and-controllers.mdPods, Deployments, StatefulSets, DaemonSets, Jobs, CronJobs, ReplicaSets, init containers, sidecar pattern, pod lifecycle
references/networking-and-services.mdService types (ClusterIP/NodePort/LoadBalancer/ExternalName), Ingress controllers, DNS, service discovery, service mesh overview
references/helm-and-kustomize.mdHelm chart anatomy, values/templates, chart repositories, hooks, Kustomize bases/overlays, patches, strategic merge, Helm vs Kustomize selection
references/security-and-rbac.mdRBAC (Roles/ClusterRoles/Bindings), ServiceAccounts, Pod Security Standards/Admission, NetworkPolicies, SecurityContext, OPA/Gatekeeper
references/storage-and-configuration.mdPersistentVolumes, PersistentVolumeClaims, StorageClasses, volume types, ConfigMaps, Secrets, External Secrets Operator, projected volumes
references/autoscaling-and-resources.mdResource requests/limits, QoS classes, LimitRanges, ResourceQuotas, HPA (v1/v2), VPA, Cluster Autoscaler, Karpenter, right-sizing
references/observability-and-debugging.mdkubectl debug/logs/exec/describe, events, Prometheus, Grafana, log aggregation, troubleshooting CrashLoopBackOff/ImagePull/Pending/OOM
references/gitops-and-deployment.mdArgoCD, Flux, rolling updates, blue/green, canary (Argo Rollouts/Flagger), PodDisruptionBudgets, rollback, progressive delivery

Command Palette

Search skills, docs, and navigate Tank