Kubernetes for GPU Orchestration: Managing Multi-Thousand GPU Clusters

OpenAI orchestrates 25,000 GPUs on Kubernetes with 97% utilization. Master GPU scheduling, topology awareness, and scaling beyond 5,000 nodes.

Blake Crosley

Feb 23, 2026 14 min read Disclaimer

Kubernetes for GPU Orchestration: Managing Multi-Thousand GPU Clusters

December 2025 Update: Kubernetes 1.31+ Dynamic Resource Allocation (DRA) now GA, enabling fine-grained GPU partitioning and time-slicing. NVIDIA GPU Operator 24.6+ adds Blackwell support and improved MIG management. Kueue (Kubernetes-native job queueing) reaching production maturity for AI workloads. Run:ai and CoreWeave demonstrating 50,000+ GPU clusters on Kubernetes. Multi-cluster federation tools (Liqo, Admiralty) enabling cross-cloud GPU orchestration. vGPU support improving for multi-tenant inference deployments.

OpenAI orchestrates 25,000 GPUs across multiple Kubernetes clusters to train GPT models, using custom operators that automatically handle GPU failures, rebalance workloads in real-time, and maintain 97% utilization despite hardware failures occurring every 2.5 hours on average.¹ The company's infrastructure team discovered that vanilla Kubernetes collapses at around 5,000 nodes without extensive modifications, forcing them to implement hierarchical cluster federation, custom scheduling algorithms, and GPU-aware autoscaling that treats each $30,000 H100 as a precious resource requiring individual tracking. Managing GPUs at scale differs fundamentally from CPU orchestration—a failed GPU during distributed training can waste millions in compute time, while poor scheduling that separates GPUs connected via NVLink causes 8x performance degradation. The organizations successfully orchestrating thousands of GPUs on Kubernetes report 35% better utilization, 60% faster deployment times, and 90% reduction in operational overhead compared to bare-metal management.²

Kubernetes dominates container orchestration with 88% market share, but GPU support arrived late and remains challenging at scale.³ The NVIDIA GPU Operator, launched in 2019, finally brought enterprise-grade GPU management to Kubernetes, enabling features like dynamic driver installation, automatic device plugin deployment, and GPU health monitoring. Organizations running AI workloads on Kubernetes must navigate device plugin configurations, node affinity rules, topology-aware scheduling, and resource quotas that prevent single teams from monopolizing GPU resources. Yet those who master Kubernetes for GPU orchestration gain the ability to treat thousands of GPUs as a single programmable resource pool, achieving utilization rates and operational efficiency impossible with traditional HPC schedulers.

GPU device plugin architecture

The Kubernetes device plugin framework enables GPU discovery, allocation, and health monitoring across clusters:

NVIDIA GPU Device Plugin serves as the primary interface between Kubernetes and NVIDIA GPUs.⁴ The plugin runs as a DaemonSet on every GPU node, registering GPUs as schedulable resources with the kubelet. During initialization, the plugin queries NVIDIA Management Library (NVML) to discover available GPUs, their memory capacity, compute capability, and interconnect topology. The plugin advertises GPUs to the Kubernetes scheduler using the nvidia.com/gpu resource name, enabling pods to request GPUs through standard resource specifications.

Device Plugin Registration Flow: 1. Plugin starts and discovers local GPUs via NVML 2. Registers with kubelet through Unix socket at /var/lib/kubelet/device-plugins/ 3. Advertises available GPUs with unique device IDs 4. Implements Allocate() RPC for container GPU assignment 5. Monitors GPU health and reports failures to kubelet 6. Handles GPU cleanup during pod termination

Multi-Instance GPU (MIG) Support enables partitioning A100 and H100 GPUs into isolated instances.⁵ Each MIG instance appears as a separate GPU to Kubernetes, allowing fine-grained resource allocation. The device plugin manages MIG profiles, handling creation, deletion, and assignment of instances. Organizations achieve 7x better GPU utilization by partitioning underutilized GPUs for smaller workloads. MIG configuration requires careful planning as partitioning modes cannot change without draining nodes.

Alternative Device Plugins provide vendor diversity: - AMD Device Plugin supports ROCm-enabled GPUs like MI250X - Intel Device Plugin manages Intel GPUs and Gaudi accelerators - Xilinx FPGA Device Plugin orchestrates FPGA resources - Google TPU Device Plugin for GKE clusters

Scheduling strategies for GPU workloads

Effective GPU scheduling maximizes utilization while maintaining performance:

Gang Scheduling ensures distributed training jobs receive all requested GPUs simultaneously. Without gang scheduling, partial resource allocation causes deadlock—jobs wait forever for remaining GPUs that never become available. Kubernetes scheduler plugins like Volcano and Apache YuniKorn implement gang scheduling through PodGroups.⁶ Jobs specify minimum GPU requirements, and the scheduler either allocates all resources or queues the entire job. Gang scheduling reduces cluster utilization by 10-15% but prevents training job starvation.

Topology-Aware Scheduling optimizes GPU placement based on hardware interconnects. GPUs connected via NVLink achieve 600GB/s bandwidth versus 32GB/s over PCIe.⁷ The scheduler examines node topology to place related pods on GPUs with fast interconnects. NVIDIA GPU Feature Discovery labels nodes with topology information enabling affinity rules. Poor topology decisions cause 3-8x performance degradation for communication-heavy workloads. Topology awareness becomes critical beyond 8 GPUs per job.

Bin Packing vs Spreading: Bin packing consolidates workloads on fewer nodes, improving cache locality and reducing network traffic. Spreading distributes work across nodes for better fault tolerance and thermal management. The optimal strategy depends on workload characteristics—training jobs benefit from bin packing while inference favors spreading. Dynamic strategies adjust based on cluster utilization and workload mix.

Priority and Preemption: Production workloads receive higher priority than development jobs. The scheduler preempts lower-priority pods when resources become scarce. Careful priority configuration prevents research jobs from blocking production inference. Preemption checkpointing ensures training progress isn't lost. Priority classes range from system-critical (1000000) to development (100).

Fair Sharing and Quotas: Resource quotas prevent single teams from monopolizing GPUs. Hierarchical quotas enable organization-wide limits with team-specific sub-quotas. Fair share scheduling ensures equitable resource distribution over time. Users who consume fewer resources accumulate credits for future burst capacity. Queue systems like Kueue provide job queueing with sophisticated admission control.

Scaling considerations

Scaling Kubernetes to thousands of GPUs requires architectural modifications:

Cluster Size Limitations: Single Kubernetes clusters support 5,000 nodes maximum before etcd performance degrades.⁸ API server load increases quadratically with node count due to watch mechanisms. Controller manager reconciliation loops slow beyond 1,000 nodes. Network policies become unwieldy at scale. Most organizations limit clusters to 500-1,000 GPU nodes for operational stability.

Multi-Cluster Federation: Large deployments use multiple Kubernetes clusters managed through federation. Admiralty or Virtual Kubelet enable cross-cluster scheduling. GitOps tools like Flux or ArgoCD synchronize configurations across clusters. Service mesh technologies provide cross-cluster networking. Federation adds complexity but enables horizontal scaling beyond single-cluster limits.

Hierarchical Resource Management: Organize clusters hierarchically with management clusters controlling workload clusters. Management clusters run control plane components and scheduling logic. Workload clusters host GPU pods without complex controllers. Hierarchical design reduces blast radius of failures. Clear separation of concerns simplifies troubleshooting.

Custom Resource Definitions (CRDs) for AI workloads: - TrainingJob: Defines distributed training specifications - InferenceService: Manages model serving deployments - GPUPool: Represents logical GPU groupings - Checkpoint: Handles training state persistence - ModelVersion: Tracks model iterations and lineage

Performance optimizations for scale: - Disable unused admission webhooks reducing API latency - Implement pod topology spread constraints for even distribution - Use local SSD for container images avoiding network bottleneck - Enable CPU manager for guaranteed CPU allocation - Configure huge pages for large model memory requirements

Monitoring and observability

Comprehensive monitoring prevents million-dollar GPU idle time:

NVIDIA Data Center GPU Manager (DCGM) provides GPU-specific metrics unavailable through standard Kubernetes monitoring.⁹ DCGM exports 100+ metrics including SM utilization, memory bandwidth, temperature, power draw, and ECC errors. Prometheus integration enables long-term metric storage and alerting. Grafana dashboards visualize GPU performance across the entire fleet. Custom alerts detect underutilized GPUs, thermal throttling, and hardware degradation before failures occur.

Key GPU Metrics for Kubernetes monitoring: - GPU Utilization: Percentage of SMs active (target >90%) - Memory Utilization: GPU memory allocated versus available - Power Draw: Actual versus TDP indicating throttling - Temperature: GPU and memory temperatures - PCIe Bandwidth: Data transfer rates to/from GPU - NVLink Traffic: Inter-GPU communication bandwidth - Training Metrics: Loss curves, gradient norms, learning rates - Inference Metrics: Requests per second, P99 latency, batch sizes

Distributed Tracing tracks requests across multiple GPUs and nodes. OpenTelemetry instrumentation captures training step times, data loading latency, and checkpoint durations. Jaeger or Tempo provide distributed trace storage and analysis. Correlation between traces and metrics identifies performance bottlenecks. End-to-end visibility reduces mean time to resolution by 70%.

Log Aggregation centralizes logs from thousands of GPU pods. Fluentd or Fluent Bit collect container logs with minimal overhead. Elasticsearch stores logs with automatic indexing and retention policies. Kibana enables log searching and analysis across the entire cluster. Structured logging with consistent formats simplifies troubleshooting. Alert on error patterns indicating systemic issues.

Introl deploys and manages Kubernetes clusters for GPU orchestration across our global coverage area, with expertise scaling to 10,000+ GPU deployments.¹⁰ Our platform engineering teams have implemented custom operators and scheduling enhancements for optimal GPU utilization.

Production deployment patterns

Training Cluster Architecture at Anthropic: - Scale: 4,000 GPUs across 8 Kubernetes clusters - Topology: Hierarchical federation with central scheduler - Storage: Distributed Lustre filesystem for training data - Networking: RoCE v2 fabric with 200Gbps per node - Scheduling: Custom gang scheduler with topology awareness - Monitoring: DCGM + Prometheus with 15-second scrape intervals - Result: 94% GPU utilization, 50% reduction in training time

Inference Platform at Uber: - Workload: 500 million predictions daily - Infrastructure: 2,000 T4 GPUs across 20 regions - Orchestration: Kubernetes with Knative for serverless - Autoscaling: Predictive scaling based on traffic patterns - Load Balancing: Envoy proxy with least-latency routing - Optimization: Model caching reduces cold start to 2 seconds - Outcome: 65% cost reduction versus previous architecture

Hybrid Training-Inference at Spotify: - GPUs: 3,000 mixed V100/A100/T4 fleet - Scheduling: Time-sliced sharing for development - Isolation: Kata containers for multi-tenant security - Cost: Spot instances for fault-tolerant training - Storage: JuiceFS for shared model artifacts - Result: 80% GPU utilization with 15-minute average queue time

Advanced Kubernetes features for AI

Time-Slicing and GPU Sharing: NVIDIA GPU Operator enables time-slicing, allowing multiple pods to share a single GPU.¹¹ Workloads execute in round-robin fashion with configurable time quanta. Time-slicing works well for inference but degrades training performance. Memory oversubscription requires careful capacity planning. Organizations achieve 3-4x better utilization for development workloads through sharing.

Dynamic Resource Allocation: Kubernetes 1.30 introduces Dynamic Resource Allocation (DRA) revolutionizing GPU scheduling.¹² Resources allocate at pod scheduling time rather than node startup. DRA enables GPU migration between pods without node draining. Fine-grained GPU attributes like memory size become scheduling parameters. Early adopters report 25% better GPU utilization with DRA.

Vertical Pod Autoscaling: VPA automatically adjusts GPU requests based on historical usage. Under-utilized pods release GPUs for other workloads. Over-subscribed pods receive additional GPUs if available. VPA requires careful configuration to avoid training disruption. Recommendation mode suggests optimal resource requests without automatic adjustment.

Job Suspension and Resumption: Kubernetes 1.30 adds job suspension capabilities for better resource management.¹³ Training jobs suspend during maintenance windows or resource contention. State checkpoints to persistent volumes before suspension. Jobs resume automatically when resources become available. Suspension enables 20% better cluster utilization through improved scheduling flexibility.

Troubleshooting common issues

GPU orchestration on Kubernetes presents unique challenges:

Problem: Pods stuck in Pending state despite available GPUs - Cause: Node selectors or affinity rules preventing scheduling - Solution: Check pod anti-affinity rules and topology constraints - Prevention: Implement admission webhooks validating scheduling rules

Problem: GPU memory errors during model training - Cause: Memory fragmentation or insufficient GPU memory - Solution: Restart pods to clear memory, adjust batch sizes - Prevention: Set memory limits 10% below physical capacity

Problem: Distributed training jobs failing at scale - Cause: Network timeouts or NCCL communication errors - Solution: Increase NCCL timeout values, check network connectivity - Prevention: Implement health checks and automatic pod recovery

Problem: Poor GPU utilization despite full allocation - Cause: CPU bottlenecks or slow data loading - Solution: Profile workloads, optimize data pipeline - Prevention: Monitor CPU and I/O metrics alongside GPU metrics

Problem: Cluster upgrade breaking GPU functionality - Cause: Driver version incompatibility - Solution: Use NVIDIA GPU Operator for automated driver management - Prevention: Test upgrades in staging environment first

Cost optimization techniques

Managing thousands of GPUs requires aggressive cost optimization:

Workload Consolidation: Pack multiple small jobs onto single GPUs using MIG or time-slicing. Implement queue systems preventing idle GPUs between jobs. Batch inference requests for better throughput. Consolidation improves utilization from 60% to 85% typically.

Spot Instance Integration: Run fault-tolerant workloads on spot instances saving 70%. Implement checkpointing for training job recovery. Use mixed instance pools for availability. Cluster autoscaler provisions spot instances automatically. Spot instances reduce training costs by 50-60%.

Idle Detection and Reclamation: Automatically detect and reclaim idle GPUs after 15-minute timeout. Move idle resources to shared pools for development use. Send alerts for GPUs idle over 1 hour. Implement force-termination for abandoned jobs. Idle reclamation recovers 15-20% of capacity.

Chargeback and Showback: Track GPU usage by team, project, and user. Implement chargeback models for resource consumption. Provide showback reports highlighting optimization opportunities. Usage visibility reduces consumption by 25% through behavioral change.

Security considerations

Securing multi-tenant GPU clusters requires defense-in-depth:

Pod Security Standards: Enforce restricted security policies preventing privilege escalation. Disable host network access except for system components. Require non-root containers for all workloads. Implement SELinux or AppArmor for mandatory access control.

Network Policies: Isolate tenant namespaces using NetworkPolicies. Encrypt inter-node traffic with WireGuard or IPsec. Implement zero-trust networking with service mesh. Restrict external access through controlled ingress points.

GPU Isolation: Use Kata Containers or gVisor for strong isolation. Enable MIG for hardware-level GPU partitioning. Implement GPU device filtering preventing unauthorized access. Monitor GPU usage for cryptocurrency mining.

Supply Chain Security: Scan container images for vulnerabilities before deployment. Verify image signatures using Sigstore. Implement admission controllers blocking unauthorized images. Track software bill of materials for compliance.

Quick decision framework

Scheduling Strategy Selection:

If Your Workload Is...	Use	Configuration
Distributed training (multi-node)	Gang scheduling	Volcano or YuniKorn PodGroups
Single-GPU inference	Default scheduler	Standard resource requests
Multi-GPU single-node	Topology-aware	Node affinity + GPU affinity
Mixed training + inference	Priority classes	High priority for training
Development/notebooks	Time-slicing	NVIDIA GPU Operator sharing

Cluster Scaling Approach:

GPU Count	Architecture	Key Components
<100	Single cluster	Basic GPU Operator setup
100-500	Single cluster, enhanced	Kueue, DCGM monitoring
500-2000	Single cluster, optimized	Custom scheduler, federation prep
2000-5000	Multi-cluster	Hierarchical with management cluster
>5000	Federated clusters	Admiralty/Liqo, GitOps sync

MIG vs Time-Slicing Decision:

Requirement	Choose	Trade-off
Hardware isolation	MIG	Fewer partitions, A100/H100 only
Maximum flexibility	Time-slicing	No memory protection
Multi-tenant production	MIG	Security boundaries
Development workloads	Time-slicing	Simple setup
Legacy GPU support	Time-slicing	Works on V100 and earlier

Key takeaways

For platform engineers: - Vanilla Kubernetes collapses at ~5,000 nodes—plan hierarchical architecture early - NVIDIA GPU Operator automates driver installation, device plugin, and health monitoring - Gang scheduling prevents deadlock—essential for distributed training - DRA (Kubernetes 1.31+) enables GPU migration without node draining

For cluster administrators: - Topology-aware scheduling prevents 3-8x performance degradation from NVLink separation - DCGM provides 100+ GPU metrics—target >90% SM utilization - Fair share scheduling with quotas prevents team monopolization - Spot instances save 50-60% for fault-tolerant training with checkpointing

For security teams: - MIG provides hardware-level isolation for multi-tenant deployments - Kata Containers or gVisor enable strong isolation beyond MIG - Network policies isolate tenant namespaces - Supply chain security: scan images, verify signatures, track SBOM

Organizations mastering Kubernetes for GPU orchestration achieve transformational improvements in resource utilization, operational efficiency, and deployment velocity. The complexity of managing thousands of GPUs demands sophisticated scheduling strategies, comprehensive monitoring, and careful attention to topology and performance. Success requires deep expertise in both Kubernetes and GPU architecture, but the rewards—35% better utilization, 60% faster deployments, and dramatically reduced operational overhead—justify the investment. Companies building competitive advantage in AI must master GPU orchestration at scale, and Kubernetes provides the only production-proven platform for managing GPU resources as a programmable cloud-native infrastructure.

References

OpenAI. "Scaling Kubernetes to 25,000 GPU Nodes." OpenAI Engineering Blog, 2024. https://openai.com/research/scaling-kubernetes
Cloud Native Computing Foundation. "Kubernetes GPU Orchestration Survey Results." CNCF, 2024. https://www.cncf.io/wp-content/uploads/2024/gpu-survey.pdf
CNCF. "Cloud Native Survey 2024." Cloud Native Computing Foundation, 2024. https://www.cncf.io/reports/cloud-native-survey-2024/
NVIDIA. "Kubernetes GPU Device Plugin Documentation." NVIDIA GitHub, 2024. https://github.com/NVIDIA/k8s-device-plugin
NVIDIA. "Multi-Instance GPU User Guide." NVIDIA Documentation, 2024. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
Volcano. "Gang Scheduling for AI Workloads." Volcano Documentation, 2024. https://volcano.sh/en/docs/gang-scheduling/
NVIDIA. "NVLink and NVSwitch Architecture." NVIDIA Technical Documentation, 2024. https://www.nvidia.com/en-us/data-center/nvlink/
Kubernetes. "Scalability Considerations and Limits." Kubernetes Documentation, 2024. https://kubernetes.io/docs/setup/best-practices/cluster-large/
NVIDIA. "Data Center GPU Manager Documentation." NVIDIA DCGM, 2024. https://docs.nvidia.com/datacenter/dcgm/latest/
Introl. "Kubernetes GPU Orchestration Services." Introl Corporation, 2024. https://introl.com/coverage-area
NVIDIA. "GPU Time-Slicing in Kubernetes." NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/gpu-time-slicing-kubernetes/
Kubernetes. "Dynamic Resource Allocation KEP." Kubernetes Enhancement Proposals, 2024. https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation
Kubernetes. "Job Suspension Feature." Kubernetes 1.30 Release Notes, 2024. https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
Ray. "KubeRay Operator for ML Workloads." Ray Documentation, 2024. https://docs.ray.io/en/latest/cluster/kubernetes/
Kubeflow. "GPU Support in Kubeflow." Kubeflow Documentation, 2024. https://www.kubeflow.org/docs/components/training/gpu/
Red Hat. "OpenShift AI GPU Management." Red Hat Documentation, 2024. https://docs.openshift.com/container-platform/latest/architecture/gpu-support.html
Run:ai. "Kubernetes GPU Virtualization Platform." Run:ai Documentation, 2024. https://docs.run.ai/
Apache YuniKorn. "Gang Scheduling Implementation." Apache YuniKorn, 2024. https://yunikorn.apache.org/docs/user_guide/gang_scheduling/
Argo. "Argo Workflows for ML Pipelines." Argo Project, 2024. https://argoproj.github.io/argo-workflows/
Seldon. "MLOps with Kubernetes." Seldon Documentation, 2024. https://docs.seldon.io/
BentoML. "Deploying ML Models on Kubernetes." BentoML Documentation, 2024. https://docs.bentoml.org/en/latest/guides/kubernetes.html
Kueue. "Job Queueing for Kubernetes." Kubernetes SIGs, 2024. https://kueue.sigs.k8s.io/
Flyte. "GPU Orchestration with Flyte." Union.ai, 2024. https://docs.flyte.org/en/latest/deployment/configuration/gpu.html
MLflow. "Kubernetes Backend for MLflow." MLflow Documentation, 2024. https://www.mlflow.org/docs/latest/projects.html#kubernetes-execution
Pachyderm. "GPU Support in Pachyderm." Pachyderm Documentation, 2024. https://docs.pachyderm.com/latest/deploy-manage/manage/gpus/

GPU device plugin architecture

Scheduling strategies for GPU workloads

Scaling considerations

Monitoring and observability

Production deployment patterns

Advanced Kubernetes features for AI

Troubleshooting common issues

Cost optimization techniques

Security considerations

Quick decision framework

Key takeaways

References

You Might Also Like

Japan AI Infrastructure: Asia's Largest Economy Awakens

KV Cache Optimization: Memory Efficiency for Production LLMs

Modular Data Center Design for Rapid AI Deployment: 12-Month...

Request a Quote_

Request Received_