@tank/kueue
1.0.0Skill
Description
Kubernetes-native queueing and quota management for batch, AI/ML, and HPC workloads with Kueue. Covers ClusterQueue, LocalQueue, ResourceFlavor, Workload, cohorts, admission checks, supported integrations, installation, kueuectl, metrics, preemption, fair sharing, MultiKueue, topology-aware scheduling, ProvisioningRequest autoscaling, tenancy patterns, and production recipes..
Download
Verified
tank install @tank/kueueKueue
Core Philosophy
- Suspend first, admit later — Every supported Job is created in
suspend: truestate by Kueue's mutating webhook. The Workload object is the queueing unit; the Job only unsuspends after the Workload is admitted. Never bypass this by settingsuspend: falsemanually. - Quota lives on the ClusterQueue, naming lives on the LocalQueue — Users submit Jobs with a
kueue.x-k8s.io/queue-name: <local-queue>label. The LocalQueue points to a ClusterQueue, which holds the actual quota. This separates tenant identity from capacity policy. - Cohorts are how slack capacity is shared — Two ClusterQueues in the same
cohortcan borrow each other's unusednominalQuotaup to theirborrowingLimit. Without a cohort, quota is hard-isolated. SetlendingLimitto cap how much a queue can be borrowed from. - Flavors map workloads to hardware — A
ResourceFlavoris a tuple of(nodeLabels, taints, tolerations). Quota is per-flavor. The samecpuresource can have separate quotas forspotandondemandflavors, andflavorFungibilitycontrols fall-through behavior. - AdmissionChecks gate the final unsuspend — Quota reservation is necessary but not sufficient. AdmissionChecks (e.g., MultiKueue, ProvisioningRequest) must all pass
Readybefore the Workload transitions toAdmittedand the Job unsuspends.
Quick-Start: Common Problems
"My Job is created but no Pods appear"
- Check the Job has the queue label:
kubectl get job <name> -o jsonpath='{.metadata.labels.kueue\.x-k8s\.io/queue-name}' - Check the Workload exists:
kubectl get workload -l kueue.x-k8s.io/job-name=<job> - Read Workload conditions:
kubectl describe workload <wl>— look atQuotaReserved,Admitted,Inadmissiblereasons - If
Inadmissible: quota is full or no flavor matches the Pod's nodeSelector - If
QuotaReservedbut notAdmitted: an AdmissionCheck is pending -> Seereferences/operations-and-cli.md
"Which Kueue CRD do I create first?"
| Order | CRD | Created by |
|---|---|---|
| 1 | ResourceFlavor | Cluster admin |
| 2 | ClusterQueue (references flavors, sets quota, optional cohort) | Cluster admin |
| 3 | LocalQueue in tenant namespace (points to ClusterQueue) | Namespace admin |
| 4 | WorkloadPriorityClass (optional) | Cluster admin |
| 5 | Job/RayJob/PyTorchJob with kueue.x-k8s.io/queue-name label | User |
-> See references/concepts-and-architecture.md |
"My framework isn't being managed by Kueue"
- Check it's enabled in Kueue Configuration:
kubectl -n kueue-system get cm kueue-manager-config -o yaml | grep frameworks - Restart kueue-controller-manager after editing config:
kubectl -n kueue-system rollout restart deploy/kueue-controller-manager - For pod integration:
podframework must be enabled ANDmanagedJobsNamespaceSelectormust match the Pod's namespace - For Deployment/StatefulSet: requires the
podintegration + labelkueue.x-k8s.io/queue-nameon the Pod template -> Seereferences/job-integrations.md
"How do I let Cluster Autoscaler scale up GPU nodes for queued workloads?"
- Add an
AdmissionCheckof kindProvisioningRequestto the GPU ClusterQueue - Create a
ProvisioningRequestConfigreferencing the autoscaler's provisioning class (check-capacity.autoscaling.x-k8s.io,queued-provisioning.gke.io, or Karpenter's class) - Pending Workload triggers a ProvisioningRequest → autoscaler scales nodes → Workload admitted → Pods schedule
-> See
references/advanced-features.md
"I want fair sharing between teams"
- Enable in Configuration:
fairSharing.enable: true - Put team ClusterQueues in the same
cohort - Set
fairSharing.weightper ClusterQueue (default 1) - Choose
preemptionStrategies(LessThanOrEqualToFinalShare for stable, LessThanInitialShare for aggressive) -> Seereferences/advanced-features.mdandreferences/quota-and-tenancy-patterns.md
Decision Trees
Which Queueing Strategy?
| Signal | Strategy |
|---|---|
| Strict in-order admission (FIFO with head-of-line blocking acceptable) | StrictFIFO |
| Maximize throughput, allow later workloads to admit if head is blocked | BestEffortFIFO (default) |
| Need priority-based ordering | Add WorkloadPriorityClass to either |
Which Preemption Policy?
| Goal | withinClusterQueue | reclaimWithinCohort |
|---|---|---|
| No preemption (hard isolation) | Never | Never |
| Higher priority preempts lower | LowerPriority | LowerPriority |
| Reclaim borrowed capacity | (own choice) | Any |
| Anything goes (research clusters) | Any | Any |
Flavor Fungibility Behavior
whenCanBorrow | whenCanPreempt | Behavior |
|---|---|---|
Borrow | TryNextFlavor | Try borrowing in current flavor before falling through |
TryNextFlavor | TryNextFlavor | Always try next flavor first (e.g., spot before on-demand) |
Borrow | Preempt | Aggressive: borrow or preempt within current flavor before fallback |
Single ClusterQueue or Many?
| Signal | Topology |
|---|---|
| Small team, one project | 1 ClusterQueue, no cohort |
| Multiple teams, want sharing | N ClusterQueues, same cohort, borrowingLimit set |
| Org → Department → Team hierarchy | Hierarchical Cohorts (parent/child) |
| Strict per-team isolation | N ClusterQueues, no cohort (or borrowingLimit: 0) |
| Multi-cluster federation | MultiKueue with management + worker clusters |
Reference Index
| File | Contents |
|---|---|
references/concepts-and-architecture.md | Problem positioning, all CRDs (ResourceFlavor, ClusterQueue, LocalQueue, Workload, Cohort, AdmissionCheck, Topology, WorkloadPriorityClass), scheduling lifecycle (suspend → quota check → flavor assignment → admission check → unsuspend), controller architecture (reconcilers, in-memory cache, webhooks), resource model with borrowing/lending semantics, queueing strategies |
references/job-integrations.md | Universal kueue.x-k8s.io/queue-name pattern, full integrations.frameworks enablement table, per-framework YAML for batch/v1 Job, JobSet, Kubeflow v1 (PyTorchJob/TFJob/MPIJob/XGBoostJob/PaddleJob/JAXJob), Kubeflow Trainer v2 (TrainJob), KubeRay (RayJob/RayCluster/RayService), AppWrapper, plain Pods (single + groups), Deployment/StatefulSet, LeaderWorkerSet, Spark, custom integrations |
references/installation-and-config.md | kubectl apply / Helm OCI (oci://registry.k8s.io/kueue/charts/kueue) / Kustomize / GitOps install, full Configuration kind reference (manageJobsWithoutQueueName, managedJobsNamespaceSelector, integrations, multiKueue, fairSharing, waitForPodsReady, internalCertManagement, leaderElection), feature gates table by maturity, upgrade path, HA + cert-manager + ServiceMonitor production setup |
references/advanced-features.md | Preemption (withinClusterQueue, reclaimWithinCohort, borrowWithinCohort), Fair Sharing (DRF, weights, preemption strategies, AdmissionFairSharing), MultiKueue (architecture, MultiKueueConfig/Cluster, dispatcherName, supported jobs), Topology-Aware Scheduling (Topology CRD, podset annotations, NCCL locality), ProvisioningRequest (Cluster Autoscaler + Karpenter integration), Hierarchical Cohorts, WorkloadPriorityClass, Partial Admission + Elastic Jobs (workload slices), waitForPodsReady gang scheduling, custom AdmissionCheckController pattern |
references/operations-and-cli.md | Full kueuectl command surface (create/list/get/describe/stop/resume/delete/edit), Workload status interpretation (Pending/QuotaReserved/Admitted/Finished/Evicted + reasons), ClusterQueue status fields, Prometheus metrics catalog (kueue_pending_workloads, kueue_admission_attempt_duration_seconds, etc.), PromQL recipes, log diagnostics, troubleshooting trees for stuck/evicted/never-admitted workloads, performance tuning, drain/migrate procedures |
references/quota-and-tenancy-patterns.md | Quota semantics (nominalQuota / borrowingLimit / lendingLimit / effectiveQuota), single-tenant pattern, multi-team patterns (equal-share with borrowing, tiered priority, reserved + shared pool, strict isolation), hierarchical cohort design, flavor patterns (spot+on-demand, GPU classes, cross-zone), preemption design, namespace selectors, RBAC, cost allocation, anti-patterns, safe migration playbook |
references/core-batch-ai-recipes.md | End-to-end YAML recipes for basic batch queueing, multi-team GPU sharing with preemption, distributed PyTorch (Kubeflow), Ray hyperparameter tuning, spot+on-demand fallback, GPU autoscaling with ProvisioningRequest/Karpenter/Cluster Autoscaler, and MultiKueue federated dispatch |
references/operations-services-migration-recipes.md | Production YAML recipes for MPI/HPC topology-aware scheduling, long-running Deployment quota, Argo/plain Pod queueing, AppWrapper gang admission, elastic jobs, CI/CD runner pools, online-vs-batch LLM inference, staged rollout, verification commands, and gotchas |