GPU Orchestration के लिए Kubernetes: Multi-Thousand GPU Clusters का Management

Kubernetes पर multi-thousand GPU clusters को deploy और manage करें। Gang scheduling, MIG support, topology-aware placement, और production patterns।

Madison Kersh

Apr 25, 2026 10 min read Disclaimer

GPU Orchestration के लिए Kubernetes: Multi-Thousand GPU Clusters का Management

Updated December 8, 2025

December 2025 Update: Kubernetes 1.31+ Dynamic Resource Allocation (DRA) अब GA, fine-grained GPU partitioning और time-slicing को enable करता है। NVIDIA GPU Operator 24.6+ में Blackwell support और improved MIG management जोड़ा गया है। Kueue (Kubernetes-native job queueing) AI workloads के लिए production maturity तक पहुंच रहा है। Run:ai और CoreWeave Kubernetes पर 50,000+ GPU clusters demonstrate कर रहे हैं। Multi-cluster federation tools (Liqo, Admiralty) cross-cloud GPU orchestration को enable कर रहे हैं। Multi-tenant inference deployments के लिए vGPU support improve हो रहा है।

OpenAI GPT models को train करने के लिए multiple Kubernetes clusters में 25,000 GPUs orchestrate करता है, custom operators का उपयोग करके जो automatically GPU failures को handle करते हैं, real-time में workloads को rebalance करते हैं, और hardware failures जो औसतन हर 2.5 घंटे में होती हैं के बावजूद 97% utilization maintain करते हैं।¹ Company की infrastructure team ने discover किया कि vanilla Kubernetes extensive modifications के बिना लगभग 5,000 nodes पर collapse हो जाता है, जिससे उन्हें hierarchical cluster federation, custom scheduling algorithms, और GPU-aware autoscaling implement करना पड़ा जो हर $30,000 H100 को एक precious resource के रूप में treat करता है जिसे individual tracking की आवश्यकता होती है। Scale पर GPUs को manage करना CPU orchestration से fundamentally अलग है—distributed training के दौरान failed GPU millions का compute time waste कर सकता है, जबकि poor scheduling जो NVLink के माध्यम से connected GPUs को separate करती है 8x performance degradation का कारण बनती है। Organizations जो successfully Kubernetes पर thousands GPUs orchestrate कर रही हैं bare-metal management की तुलना में 35% better utilization, 60% faster deployment times, और 90% reduction in operational overhead report करती हैं।²

Kubernetes 88% market share के साथ container orchestration पर dominate करता है, लेकिन GPU support late आया और scale पर challenging रहता है।³ NVIDIA GPU Operator, 2019 में launch हुआ, finally Kubernetes में enterprise-grade GPU management लाया, features जैसे dynamic driver installation, automatic device plugin deployment, और GPU health monitoring को enable करते हुए। AI workloads को Kubernetes पर run करने वाली organizations को device plugin configurations, node affinity rules, topology-aware scheduling, और resource quotas navigate करना पड़ता है जो single teams को GPU resources monopolize करने से रोकते हैं। फिर भी जो Kubernetes for GPU orchestration को master करते हैं वे thousands GPUs को एक single programmable resource pool के रूप में treat करने की ability gain करते हैं, utilization rates और operational efficiency achieve करते हैं जो traditional HPC schedulers के साथ impossible है।

GPU device plugin architecture

Kubernetes device plugin framework clusters में GPU discovery, allocation, और health monitoring enable करता है:

NVIDIA GPU Device Plugin Kubernetes और NVIDIA GPUs के बीच primary interface के रूप में serve करता है।⁴ Plugin हर GPU node पर DaemonSet के रूप में run होता है, kubelet के साथ GPUs को schedulable resources के रूप में register करता है। Initialization के दौरान, plugin NVIDIA Management Library (NVML) को query करता है available GPUs, उनकी memory capacity, compute capability, और interconnect topology discover करने के लिए। Plugin nvidia.com/gpu resource name का उपयोग करके GPUs को Kubernetes scheduler को advertise करता है, pods को standard resource specifications के माध्यम से GPUs request करने में enable करता है।

Device Plugin Registration Flow: 1. Plugin start होता है और NVML के माध्यम से local GPUs discover करता है 2. /var/lib/kubelet/device-plugins/ पर Unix socket के माध्यम से kubelet के साथ register करता है 3. Unique device IDs के साथ available GPUs को advertise करता है 4. Container GPU assignment के लिए Allocate() RPC implement करता है 5. GPU health monitor करता है और kubelet को failures report करता है 6. Pod termination के दौरान GPU cleanup handle करता है

Multi-Instance GPU (MIG) Support A100 और H100 GPUs को isolated instances में partition करना enable करता है।⁵ हर MIG instance Kubernetes को एक separate GPU के रूप में appear होता है, fine-grained resource allocation allow करता है। Device plugin MIG profiles manage करता है, instances की creation, deletion, और assignment handle करता है। Organizations smaller workloads के लिए underutilized GPUs को partition करके 7x better GPU utilization achieve करती हैं। MIG configuration careful planning require करता है क्योंकि partitioning modes nodes को drain किए बिना change नहीं हो सकते।

Alternative Device Plugins vendor diversity provide करते हैं: - AMD Device Plugin MI250X जैसे ROCm-enabled GPUs support करता है - Intel Device Plugin Intel GPUs और Gaudi accelerators manage करता है - Xilinx FPGA Device Plugin FPGA resources orchestrate करता है - Google TPU Device Plugin GKE clusters के लिए

Scheduling strategies for GPU workloads

Effective GPU scheduling performance maintain करते हुए utilization maximize करती है:

Gang Scheduling ensure करता है कि distributed training jobs सभी requested GPUs simultaneously receive करें। Gang scheduling के बिना, partial resource allocation deadlock का कारण बनता है—jobs forever remaining GPUs के लिए wait करती हैं जो कभी available नहीं होते। Kubernetes scheduler plugins जैसे Volcano और Apache YuniKorn PodGroups के माध्यम से gang scheduling implement करते हैं।⁶ Jobs minimum GPU requirements specify करती हैं, और scheduler या तो सभी resources allocate करता है या entire job को queue करता है। Gang scheduling cluster utilization को 10-15% reduce करता है लेकिन training job starvation prevent करता है।

Topology-Aware Scheduling hardware interconnects के based पर GPU placement optimize करती है। NVLink के माध्यम से connected GPUs PCIe पर 32GB/s के versus 600GB/s bandwidth achieve करते हैं।⁷ Scheduler related pods को fast interconnects वाले GPUs पर place करने के लिए node topology examine करता है। NVIDIA GPU Feature Discovery topology information के साथ nodes को label करता है affinity rules enable करते हुए। Poor topology decisions communication-heavy workloads के लिए 3-8x performance degradation cause करते हैं। 8 GPUs per job से beyond topology awareness critical हो जाती है।

Bin Packing vs Spreading: Bin packing कम nodes पर workloads consolidate करता है, cache locality improve करता है और network traffic reduce करता है। Spreading better fault tolerance और thermal management के लिए nodes में work distribute करता है। Optimal strategy workload characteristics पर depend करती है—training jobs bin packing से benefit करती हैं जबकि inference spreading को favor करता है। Dynamic strategies cluster utilization और workload mix के based पर adjust करती हैं।

Priority और Preemption: Production workloads development jobs से higher priority receive करते हैं। Scheduler resources scarce होने पर lower-priority pods को preempt करता है। Careful priority configuration research jobs को production inference block करने से prevent करता है। Preemption checkpointing ensure करता है कि training progress lost न हो। Priority classes system-critical (1000000) से development (100) तक range करती हैं।

Fair Sharing और Quotas: Resource quotas single teams को GPUs monopolize करने से prevent करते हैं। Hierarchical quotas team-specific sub-quotas के साथ organization-wide limits enable करते हैं। Fair share scheduling time के over equitable resource distribution ensure करता है। Users जो कम resources consume करते हैं future burst capacity के लिए credits accumulate करते हैं। Queue systems जैसे Kueue sophisticated admission control के साथ job queueing provide करते हैं।

Scaling considerations

Kubernetes को thousands GPUs तक scale करने के लिए architectural modifications require होते हैं:

Cluster Size Limitations: Single Kubernetes clusters etcd performance degrade होने से पहले maximum 5,000 nodes support करते हैं।⁸ Watch mechanisms के कारण node count के साथ API server load quadratically increase होता है। Controller manager reconciliation loops 1,000 nodes से beyond slow हो जाते हैं। Network policies scale पर unwieldy हो जाती हैं। Most organizations operational stability के लिए clusters को 500-1,000 GPU nodes तक limit करती हैं।

Multi-Cluster Federation: Large deployments federation के माध्यम से managed multiple Kubernetes clusters use करती हैं। Admiralty या Virtual Kubelet cross-cluster scheduling enable करते हैं। GitOps tools जैसे Flux या ArgoCD clusters में configurations synchronize करते हैं। Service mesh technologies cross-cluster networking provide करती हैं। Federation complexity add करता है लेकिन single-cluster limits से beyond horizontal scaling enable करता है।

Hierarchical Resource Management: Workload clusters को control करने वाले management clusters के साथ clusters को hierarchically organize करें। Management clusters control plane components और scheduling logic run करते हैं। Workload clusters complex controllers के बिना GPU pods host करते हैं। Hierarchical design failures का blast radius reduce करता है। Clear separation of concerns troubleshooting simplify करता है।

Custom Resource Definitions (CRDs) AI workloads के लिए: - TrainingJob: Distributed training specifications define करता है - InferenceService: Model serving deployments manage करता है - GPUPool: Logical GPU groupings represent करता है - Checkpoint: Training state persistence handle करता है - ModelVersion: Model iterations और lineage track करता है

Scale के लिए performance optimizations: - API latency reduce करने के लिए unused admission webhooks disable करें - Even distribution के लिए pod topology spread constraints implement करें - Network bottleneck avoid करने के लिए container images के लिए local SSD use करें - Guaranteed CPU allocation के लिए CPU manager enable करें - Large model memory requirements के लिए huge pages configure करें

Monitoring और observability

Comprehensive monitoring million-dollar GPU idle time prevent करती है:

NVIDIA Data Center GPU Manager (DCGM) standard Kubernetes monitoring के माध्यम से unavailable GPU-specific metrics provide करता है।⁹ DCGM SM utilization, memory bandwidth, temperature, power draw, और ECC errors सहित 100+ metrics export करता है। Prometheus integration long-term metric storage और alerting enable करता है। Grafana dashboards entire fleet में GPU performance visualize करते हैं। Custom alerts underutilized GPUs, thermal throttling, और failures से पहले hardware degradation detect करते हैं।

Key GPU Metrics Kubernetes monitoring के लिए: - GPU Utilization: SMs active का percentage (target >90%) - Memory Utilization: GPU memory allocated versus available - Power Draw: Throttling indicate करने वाला actual versus TDP - Temperature: GPU और memory temperatures - PCIe Bandwidth: GPU to/from data transfer rates - NVLink Traffic: Inter-GPU communication bandwidth - Training Metrics: Loss curves, gradient norms, learning rates - Inference Metrics: Requests per second, P99 latency, batch sizes

Distributed Tracing multiple GPUs और nodes में requests track करता है। OpenTelemetry instrumentation training step times, data loading latency, और checkpoint durations capture करता है। Jaeger या Tempo distributed trace storage और analysis provide करते हैं। Traces और metrics के बीच correlation performance bottlenecks identify करता है। End-to-end visibility mean time to resolution को 70% reduce करती है।

Log Aggregation thousands GPU pods से logs centralize करता है। Fluentd या Fluent Bit minimal overhead के साथ container logs collect करते हैं। Elasticsearch automatic indexing और retention policies के साथ logs store करता है। Kibana entire cluster में log searching और analysis enable करता है। Consistent formats के साथ structured logging troubleshooting simplify करती है। Systemic issues indicate करने वाले error patterns पर alert करें।

Introl हमारे global coverage area में GPU orchestration के लिए Kubernetes clusters deploy और manage करता है, 10,000+ GPU deployments तक scale करने की expertise के साथ।¹⁰ हमारी platform engineering teams ने optimal GPU utilization के लिए custom operators और scheduling enhancements implement किए हैं।

Production deployment patterns

Training Cluster Architecture Anthropic में: - Scale: 8 Kubernetes clusters में 4,000 GPUs - Topology: Central scheduler के साथ hierarchical federation - Storage: Training data के लिए distributed Lustre filesystem - Networking: प्रति node 200Gbps के साथ RoCE v2 fabric - Scheduling: Topology awareness के साथ custom gang scheduler - Monitoring: 15-second scrape intervals के साथ DCGM + Prometheus - Result: 94% GPU utilization, training time में 50% reduction

Inference Platform Uber में: - Workload: Daily 500 million predictions - Infrastructure: 20 regions में 2,000 T4 GPUs - Orchestration: Serverless के लिए Knative के साथ Kubernetes - Autoscaling: Traffic patterns के based पर predictive scaling - Load Balancing: Least-latency routing के साथ Envoy proxy - Optimization: Model caching cold start को 2 seconds तक reduce करता है - Outcome: Previous architecture के versus 65% cost reduction

Hybrid Training-Inference Spotify में: - GPUs: 3,000 mixed V100/A100/T4 fleet - Scheduling: Development के लिए time-sliced sharing - Isolation: Multi-tenant security के लिए Kata containers - Cos

GPU Orchestration के लिए Kubernetes: Multi-Thousand GPU Clusters का Management

GPU device plugin architecture

Scheduling strategies for GPU workloads

Scaling considerations

Monitoring और observability

Production deployment patterns

You Might Also Like

AI Workload Scheduling: समय क्षेत्रों में GPU उपयोग का अनुकू...

AI Infrastructure Security Operations: GPU Clusters के लिए S...

$600B AI Infrastructure निर्माण: Hyperscaler CapEx, ऋण, और आ...

कोटेशन का अनुरोध करें_

अनुरोध प्राप्त हुआ_