Skip to content

@tank/kueue

1.0.0
Skill

Description

Kubernetes-native queueing and quota management for batch, AI/ML, and HPC workloads with Kueue. Covers ClusterQueue, LocalQueue, ResourceFlavor, Workload, cohorts, admission checks, supported integrations, installation, kueuectl, metrics, preemption, fair sharing, MultiKueue, topology-aware scheduling, ProvisioningRequest autoscaling, tenancy patterns, and production recipes..

Download
Verified
tank install @tank/kueue

Kueue

Core Philosophy

  1. Suspend first, admit later — Every supported Job is created in suspend: true state by Kueue's mutating webhook. The Workload object is the queueing unit; the Job only unsuspends after the Workload is admitted. Never bypass this by setting suspend: false manually.
  2. Quota lives on the ClusterQueue, naming lives on the LocalQueue — Users submit Jobs with a kueue.x-k8s.io/queue-name: <local-queue> label. The LocalQueue points to a ClusterQueue, which holds the actual quota. This separates tenant identity from capacity policy.
  3. Cohorts are how slack capacity is shared — Two ClusterQueues in the same cohort can borrow each other's unused nominalQuota up to their borrowingLimit. Without a cohort, quota is hard-isolated. Set lendingLimit to cap how much a queue can be borrowed from.
  4. Flavors map workloads to hardware — A ResourceFlavor is a tuple of (nodeLabels, taints, tolerations). Quota is per-flavor. The same cpu resource can have separate quotas for spot and ondemand flavors, and flavorFungibility controls fall-through behavior.
  5. AdmissionChecks gate the final unsuspend — Quota reservation is necessary but not sufficient. AdmissionChecks (e.g., MultiKueue, ProvisioningRequest) must all pass Ready before the Workload transitions to Admitted and the Job unsuspends.

Quick-Start: Common Problems

"My Job is created but no Pods appear"

  1. Check the Job has the queue label: kubectl get job <name> -o jsonpath='{.metadata.labels.kueue\.x-k8s\.io/queue-name}'
  2. Check the Workload exists: kubectl get workload -l kueue.x-k8s.io/job-name=<job>
  3. Read Workload conditions: kubectl describe workload <wl> — look at QuotaReserved, Admitted, Inadmissible reasons
  4. If Inadmissible: quota is full or no flavor matches the Pod's nodeSelector
  5. If QuotaReserved but not Admitted: an AdmissionCheck is pending -> See references/operations-and-cli.md

"Which Kueue CRD do I create first?"

OrderCRDCreated by
1ResourceFlavorCluster admin
2ClusterQueue (references flavors, sets quota, optional cohort)Cluster admin
3LocalQueue in tenant namespace (points to ClusterQueue)Namespace admin
4WorkloadPriorityClass (optional)Cluster admin
5Job/RayJob/PyTorchJob with kueue.x-k8s.io/queue-name labelUser
-> See references/concepts-and-architecture.md

"My framework isn't being managed by Kueue"

  1. Check it's enabled in Kueue Configuration: kubectl -n kueue-system get cm kueue-manager-config -o yaml | grep frameworks
  2. Restart kueue-controller-manager after editing config: kubectl -n kueue-system rollout restart deploy/kueue-controller-manager
  3. For pod integration: pod framework must be enabled AND managedJobsNamespaceSelector must match the Pod's namespace
  4. For Deployment/StatefulSet: requires the pod integration + label kueue.x-k8s.io/queue-name on the Pod template -> See references/job-integrations.md

"How do I let Cluster Autoscaler scale up GPU nodes for queued workloads?"

  1. Add an AdmissionCheck of kind ProvisioningRequest to the GPU ClusterQueue
  2. Create a ProvisioningRequestConfig referencing the autoscaler's provisioning class (check-capacity.autoscaling.x-k8s.io, queued-provisioning.gke.io, or Karpenter's class)
  3. Pending Workload triggers a ProvisioningRequest → autoscaler scales nodes → Workload admitted → Pods schedule -> See references/advanced-features.md

"I want fair sharing between teams"

  1. Enable in Configuration: fairSharing.enable: true
  2. Put team ClusterQueues in the same cohort
  3. Set fairSharing.weight per ClusterQueue (default 1)
  4. Choose preemptionStrategies (LessThanOrEqualToFinalShare for stable, LessThanInitialShare for aggressive) -> See references/advanced-features.md and references/quota-and-tenancy-patterns.md

Decision Trees

Which Queueing Strategy?

SignalStrategy
Strict in-order admission (FIFO with head-of-line blocking acceptable)StrictFIFO
Maximize throughput, allow later workloads to admit if head is blockedBestEffortFIFO (default)
Need priority-based orderingAdd WorkloadPriorityClass to either

Which Preemption Policy?

GoalwithinClusterQueuereclaimWithinCohort
No preemption (hard isolation)NeverNever
Higher priority preempts lowerLowerPriorityLowerPriority
Reclaim borrowed capacity(own choice)Any
Anything goes (research clusters)AnyAny

Flavor Fungibility Behavior

whenCanBorrowwhenCanPreemptBehavior
BorrowTryNextFlavorTry borrowing in current flavor before falling through
TryNextFlavorTryNextFlavorAlways try next flavor first (e.g., spot before on-demand)
BorrowPreemptAggressive: borrow or preempt within current flavor before fallback

Single ClusterQueue or Many?

SignalTopology
Small team, one project1 ClusterQueue, no cohort
Multiple teams, want sharingN ClusterQueues, same cohort, borrowingLimit set
Org → Department → Team hierarchyHierarchical Cohorts (parent/child)
Strict per-team isolationN ClusterQueues, no cohort (or borrowingLimit: 0)
Multi-cluster federationMultiKueue with management + worker clusters

Reference Index

FileContents
references/concepts-and-architecture.mdProblem positioning, all CRDs (ResourceFlavor, ClusterQueue, LocalQueue, Workload, Cohort, AdmissionCheck, Topology, WorkloadPriorityClass), scheduling lifecycle (suspend → quota check → flavor assignment → admission check → unsuspend), controller architecture (reconcilers, in-memory cache, webhooks), resource model with borrowing/lending semantics, queueing strategies
references/job-integrations.mdUniversal kueue.x-k8s.io/queue-name pattern, full integrations.frameworks enablement table, per-framework YAML for batch/v1 Job, JobSet, Kubeflow v1 (PyTorchJob/TFJob/MPIJob/XGBoostJob/PaddleJob/JAXJob), Kubeflow Trainer v2 (TrainJob), KubeRay (RayJob/RayCluster/RayService), AppWrapper, plain Pods (single + groups), Deployment/StatefulSet, LeaderWorkerSet, Spark, custom integrations
references/installation-and-config.mdkubectl apply / Helm OCI (oci://registry.k8s.io/kueue/charts/kueue) / Kustomize / GitOps install, full Configuration kind reference (manageJobsWithoutQueueName, managedJobsNamespaceSelector, integrations, multiKueue, fairSharing, waitForPodsReady, internalCertManagement, leaderElection), feature gates table by maturity, upgrade path, HA + cert-manager + ServiceMonitor production setup
references/advanced-features.mdPreemption (withinClusterQueue, reclaimWithinCohort, borrowWithinCohort), Fair Sharing (DRF, weights, preemption strategies, AdmissionFairSharing), MultiKueue (architecture, MultiKueueConfig/Cluster, dispatcherName, supported jobs), Topology-Aware Scheduling (Topology CRD, podset annotations, NCCL locality), ProvisioningRequest (Cluster Autoscaler + Karpenter integration), Hierarchical Cohorts, WorkloadPriorityClass, Partial Admission + Elastic Jobs (workload slices), waitForPodsReady gang scheduling, custom AdmissionCheckController pattern
references/operations-and-cli.mdFull kueuectl command surface (create/list/get/describe/stop/resume/delete/edit), Workload status interpretation (Pending/QuotaReserved/Admitted/Finished/Evicted + reasons), ClusterQueue status fields, Prometheus metrics catalog (kueue_pending_workloads, kueue_admission_attempt_duration_seconds, etc.), PromQL recipes, log diagnostics, troubleshooting trees for stuck/evicted/never-admitted workloads, performance tuning, drain/migrate procedures
references/quota-and-tenancy-patterns.mdQuota semantics (nominalQuota / borrowingLimit / lendingLimit / effectiveQuota), single-tenant pattern, multi-team patterns (equal-share with borrowing, tiered priority, reserved + shared pool, strict isolation), hierarchical cohort design, flavor patterns (spot+on-demand, GPU classes, cross-zone), preemption design, namespace selectors, RBAC, cost allocation, anti-patterns, safe migration playbook
references/core-batch-ai-recipes.mdEnd-to-end YAML recipes for basic batch queueing, multi-team GPU sharing with preemption, distributed PyTorch (Kubeflow), Ray hyperparameter tuning, spot+on-demand fallback, GPU autoscaling with ProvisioningRequest/Karpenter/Cluster Autoscaler, and MultiKueue federated dispatch
references/operations-services-migration-recipes.mdProduction YAML recipes for MPI/HPC topology-aware scheduling, long-running Deployment quota, Argo/plain Pod queueing, AppWrapper gang admission, elastic jobs, CI/CD runner pools, online-vs-batch LLM inference, staged rollout, verification commands, and gotchas

Command Palette

Search packages, docs, and navigate Tank