Incident Response for GPU Clusters: Playbooks for Common Failure Scenarios
Updated December 8, 2025
December 2025 Update: Liquid cooling failures now top incident category for modern GPU clusters—CDU failures, leak detection, coolant quality issues. H100/H200 downtime costs $25-40K per GPU-day making rapid response critical. AIOps platforms (PagerDuty, Datadog) integrating GPU-specific runbooks. Elastic training frameworks reducing blast radius of GPU failures. Checkpoint frequency optimization (10-15 min) minimizing training loss from incidents.
When 500 H100 GPUs suddenly drop offline during a critical training run, every second costs $1,200 in lost compute time. When liquid cooling fails in a 2MW GPU cluster, temperatures rise 1°C every 30 seconds toward thermal shutdown. When InfiniBand fabric partitions during distributed training, 10,000 GPU-hours of computation becomes worthless. These scenarios demand precise, rehearsed responses that minimize damage and restore service rapidly. This guide provides battle-tested playbooks for GPU infrastructure incidents.
Incident Classification and Severity Levels
GPU infrastructure incidents require specialized severity classifications beyond traditional IT frameworks. Severity 1 (Critical) incidents involve complete cluster failure, data loss risk, or safety hazards affecting more than 100 GPUs or $50,000 hourly impact. These trigger immediate executive escalation, vendor engagement, and 24/7 war room activation. OpenAI's GPT-4 training experienced three Severity 1 incidents over six months, each requiring CEO involvement due to $2 million daily training costs.
Severity 2 (High) incidents impact 20-100 GPUs or cause 50% performance degradation across larger clusters. Response time targets 15 minutes with 2-hour resolution goals. These incidents typically involve partial cooling failures, power distribution issues, or network partition events. Meta's infrastructure automatically pages on-call engineers for Severity 2 events, with escalation to senior architects after 30 minutes without progress.
Severity 3 (Medium) incidents affect fewer than 20 GPUs or cause 25% performance degradation. These include individual node failures, driver issues, or localized network problems. Resolution targets extend to 4 hours with next-business-day follow-up acceptable. Automated systems handle 70% of Severity 3 incidents without human intervention through self-healing mechanisms.
Severity 4 (Low) incidents involve single GPU failures or minor performance variations under 10%. These enter standard ticketing workflows with 24-hour resolution targets. Anthropic's infrastructure automatically quarantines affected resources, allowing production workloads to continue while repairs proceed during maintenance windows.
Financial impact calculations drive severity assignments. Each H100 GPU represents $30,000 capital investment with $50 hourly operational cost. Training interruptions may invalidate days of computation worth millions. Lambda Labs calculates incident cost as: (GPUs affected × hourly rate × expected duration) + (checkpoint recovery time × cluster cost) + (SLA penalties). This formula triggered Severity 1 classification for a 50-GPU failure due to $500,000 checkpoint recovery costs.
Power Failure Response Procedures
Complete power loss scenarios require immediate load shedding to prevent cascade failures during recovery. UPS systems supporting GPU clusters typically provide 5-7 minutes runtime at full load. The first 30 seconds determine incident trajectory: automatic transfer switches must engage, generators must start, and cooling systems must maintain operation. Microsoft's playbook initiates automatic workload suspension within 10 seconds of power event detection.
Phase 1 (0-30 seconds) focuses on state preservation. Distributed training jobs must checkpoint immediately, requiring pre-configured checkpoint locations with sufficient bandwidth. The kubectl exec command triggers emergency checkpointing across Kubernetes pods. Storage systems switch to write-through mode, ensuring data persistence. Network equipment on separate UPS systems maintains connectivity for remote management.
Phase 2 (30 seconds - 2 minutes) involves load prioritization. Non-critical workloads terminate automatically based on pod priority classes. Inference workloads continue serving with degraded capacity. Training jobs save state and shutdown gracefully. Cooling systems reduce to minimum viable operation, maintaining temperatures below thermal limits. Power management systems shed 40% load, extending UPS runtime to 15 minutes.
Phase 3 (2-5 minutes) requires generator synchronization. Automatic transfer switches sync generator output with UPS systems before transferring load. Failed generator starts trigger immediate escalation with manual start procedures. Fuel system status verification ensures 24-hour runtime capacity. Google's data centers maintain 48-hour fuel supplies with automatic refueling contracts activated during extended outages.
Recovery procedures begin once stable power returns. Phased restoration prevents simultaneous inrush current overwhelming power systems. Storage systems initialize first, followed by network infrastructure, then compute nodes in 10% increments. GPU power limits temporarily reduce to 80% during stabilization. Full capacity returns after 30 minutes of stable operation. CoreWeave's recovery automation restores 1,000 GPUs to production in 45 minutes following power restoration.
Cooling System Failure Responses
Liquid cooling failures escalate rapidly with GPU temperatures rising 20°C per minute without active cooling. Immediate response triggers automatic frequency throttling, reducing heat generation by 40%. The nvidia-smi -pl 400 command cuts H100 power from 700W to 400W, buying critical response time. Workload migration to unaffected zones begins automatically while repair crews mobilize.
Primary loop failures require isolation of affected sections while maintaining flow to operational areas. Bypass valves redirect flow around failed components. Redundant pumps activate, maintaining 60% flow capacity. CDU (Coolant Distribution Unit) failures trigger automatic switchover to backup units within 30 seconds. Supermicro's RSD (Rack Scale Design) systems include automated valve controls isolating failures to individual racks.
Secondary loop failures between CDUs and cooling towers impact entire facilities. Emergency chillers activate within 2 minutes, providing temporary heat rejection. Data center personnel manually open emergency venting, exhausting hot air directly outside despite efficiency losses. Portable cooling units deploy to critical areas within 30 minutes. Facebook's Prineville facility maintains 2MW of portable cooling capacity for emergency response.
Leak detection triggers immediate isolation protocols. Water sensors beneath GPU racks activate solenoid valves, stopping flow within 500 milliseconds. Affected racks power down automatically while maintaining network connectivity for remote diagnosis. Recovery teams deploy absorbent materials and portable dehumidifiers preventing corrosion. Microsoft's submarine data centers use dielectric cooling fluids, eliminating water damage risk entirely.
Air cooling augmentation supports liquid-cooled systems during partial failures. CRAC (Computer Room Air Conditioning) units increase output 50% compensating for reduced liquid cooling capacity. Hot aisle containment systems activate, improving cooling efficiency 20%. Temporary fans deploy in critical areas, providing spot cooling for overheating racks. These measures maintain operations during the 4-6 hours required for liquid cooling repairs.
Network Partition and Connectivity Loss
InfiniBand fabric partitions destroy distributed training efficiency instantly. Automatic detection triggers within 100 milliseconds using subnet manager heartbeats. Affected nodes quarantine automatically, preventing partial updates corrupting model state. Job schedulers receive topology updates, rescheduling work to healthy partitions. NCCL error handling terminates affected collective operations cleanly.
Recovery requires systematic fabric reconstruction. The opensm subnet manager rebuilds routing tables, discovering surviving paths. Partial fabric operation continues at reduced bandwidth while repairs proceed. Link width degradation from 4x to 2x maintains connectivity with 50% bandwidth reduction. Amazon's EFA (Elastic Fabric Adapter) infrastructure automatically routes around failures, maintaining 85% aggregate bandwidth during single-switch failures.
Ethernet network failures impact both training and inference workloads differently. BGP (Border Gateway Protocol) reconvergence completes within 30 seconds for redundant paths. ECMP (Equal-Cost Multi-Path) routing distributes traffic across surviving links. Storage traffic prioritization ensures checkpoint operations complete despite reduced bandwidth. Quality of Service policies guarantee 40% bandwidth for critical operations.
Complete network isolation triggers autonomous operation mode. Nodes continue local computation while buffering results. Distributed training jobs pause at synchronization barriers, preserving state. Local NVMe storage buffers up to 1TB of checkpoint data awaiting connectivity restoration. Upon network recovery, buffered data synchronizes automatically, resuming operations within minutes rather than hours of restart.
DNS and service discovery failures prevent workload scheduling despite functional infrastructure. Backup DNS servers activate automatically with 15-second TTL (Time To Live) values enabling rapid updates. Kubernetes CoreDNS pods restart on unaffected nodes within 30 seconds. Static IP configurations in emergency runbooks bypass DNS for critical management access. HashiCorp Consul provides service mesh resilience with automatic failover for service discovery.
Hardware Failure Cascade Prevention
Single GPU failures can cascade through distributed training jobs affecting hundreds of devices. Immediate isolation prevents error propagation. The nvidia-smi drain command gracefully removes GPUs from resource pools. Kubernetes device plugins mark failed GPUs as unhealthy, preventing new pod scheduling. Running workloads migrate to healthy resources within 2 minutes.
Memory errors trigger progressive responses based on severity. Single-bit errors corrected by ECC continue operating with increased monitoring frequency. Double-bit errors cause immediate workload migration and GPU quarantine. Page retirement exhaustion triggers hardware replacement scheduling. Automated ordering systems maintain 2% spare inventory for rapid replacement.
Power supply failures in redundant configurations continue operating at reduced capacity. N+1 configurations lose redundancy but maintain full operation. Load balancing redistributes power draw across surviving supplies. Efficiency drops 5-10% increasing heat generation. Replacement scheduling targets 4-hour response for redundancy restoration. Tesla's Dojo clusters maintain hot-spare power supplies enabling 5-minute replacements.
Motherboard component failures require careful diagnosis distinguishing repairable from terminal failures. PCIe retimers occasionally require reseating, restoring operation without replacement. VRM (Voltage Regulator Module) failures may affect single GPUs while others continue operating. BIOS recovery procedures restore corrupted firmware without hardware replacement. Dell EMC's integrated diagnostics identify component-level failures enabling targeted repairs.
Thermal cascade prevention requires aggressive intervention. Adjacent GPU temperatures rise 5-10°C when neighbors fail. Workload redistribution prevents hot spot formation. Empty rack units between failed hardware improve airflow. Portable spot coolers deploy within 15 minutes for critical areas. Temporary power limits prevent thermal runaway while maintaining partial capacity.
Communication Protocols During Incidents
Stakeholder notification follows strict escalation matrices based on impact severity and duration. Severity 1 incidents trigger immediate notification to C-level executives, major customers, and vendor support teams. Automated systems send initial alerts within 30 seconds of detection. Human-written updates follow within 5 minutes providing context and impact assessment. Uber's GPU infrastructure sends 500,000 automated alerts monthly with 1,200 requiring human communication.
War room activation brings together infrastructure, application, and vendor teams. Video bridges open automatically for Severity 1-2 incidents. Screen sharing displays real-time metrics and system status. Dedicated scribes document all actions and decisions. Parallel communication channels separate technical discussion from executive updates. Microsoft Teams or Slack channels provide persistent documentation surviving beyond incident resolution.
Customer communication balances transparency with confidence. Initial notifications acknowledge issues without speculation on causes. Hourly updates provide meaningful progress without excessive technical detail. SLA impact calculations appear in business terms rather than infrastructure metrics. Proactive communication prevents support ticket floods that distract response teams. AWS's status page updates achieve 90% customer satisfaction despite reporting numerous daily incidents.
Vendor escalation accelerates for GPU infrastructure given specialized expertise requirements. NVIDIA Enterprise Support responds to Severity 1 incidents within 15 minutes. Hardware vendors dispatch replacement parts before failure confirmation for critical customers. Cooling vendors provide remote assistance for system adjustments. Network vendors join war rooms for fabric-wide issues. These relationships require pre-negotiated support contracts with defined escalation paths.
Internal post-incident communication ensures organizational learning. Blameless postmortems document timeline, root cause, and improvement opportunities. Wide distribution ensures similar teams learn from incidents. Action items track through completion with quarterly reviews. Google's postmortem culture documents every significant incident, building institutional knowledge preventing recurrence.
Recovery and Restoration Procedures
Checkpoint recovery determines training restart success after incidents. Valid checkpoints require model weights, optimizer state, and training configuration. Distributed checkpoints across multiple nodes require careful reconstruction. Missing shard detection triggers recovery from previous checkpoint, potentially losing hours of computation. PyTorch's distributed checkpoint format enables partial recovery when some shards survive.
Validation procedures confirm system health before resuming production workloads. Memory tests run across all GPUs verifying ECC functionality. Network tests confirm bandwidth and latency meet requirements. Thermal tests ensure cooling capacity handles full load. Performance benchmarks establish baselines for comparison. These validations typically require 2-4 hours for 1,000-GPU clusters.
Gradual workload restoration prevents secondary incidents from overwhelming recovered systems. Development workloads resume first, providing low-risk validation. Production inference workloads follow with careful monitoring. Training jobs restart last after confirming checkpoint integrity. Load increases in 25% increments with 15-minute stabilization periods. This approach identified residual issues in 15% of recoveries at Anthropic's infrastructure.
Data integrity verification ensures incidents didn't corrupt training data or model state. Checksum validation confirms storage system consistency. Model validation tests confirm expected behavior on reference inputs. Gradient statistics identify numerical instabilities from partial updates. Training metric comparison with pre-incident baselines detects subtle corruptions. OpenAI discarded 3 days of GPT-4 training after detecting checkpoint corruption from power incident.
Capacity management during recovery prevents resource contention. Reserved capacity policies ensure critical workloads receive necessary resources. Preemption policies suspend lower-priority work if required. Quota adjustments prevent any single user from monopolizing recovered resources. Admission control delays non-critical work until full capacity returns. These policies prevented secondary incidents in 90% of major recoveries at Meta.
Automated Response Systems
Kubernetes operators implement self-healing for common failure patterns. Custom Resource Definitions (CRDs) encode response procedures as declarative configurations. The GPU operator detects unhealthy devices and cordons nodes automatically. Failed pod detection triggers rescheduling within 30 seconds. Node auto-repair replaces consistently failing instances. Google Kubernetes Engine achieves 99.95% availability through extensive automation.
Ansible playbooks codify complex response procedures enabling consistent execution. Emergency response playbooks run with break-glass credentials bypassing normal access controls. Inventory updates dynamically based on cluster state. Parallel execution handles hundreds of nodes simultaneously. Idempotent operations enable safe re-execution during partial failures. Netflix's Ansible automation handles 80% of incidents without human intervention.
Machine learning models predict and prevent incidents before impact. Time series analysis identifies abnormal patterns in temperature, power, and performance metrics. Clustering algorithms group similar failure patterns enabling targeted responses. Survival analysis predicts component lifetime enabling preventive replacement. These models achieve 75% true positive rates with 5% false positives, significantly reducing incident frequency.
Chaos engineering validates response procedures through controlled failures. Randomly terminating GPUs tests workload resilience. Power failure simulations validate generator transfer procedures. Network partition injection confirms application handling. Cooling failure simulations verify thermal management. LinkedIn's chaos engineering identified response procedure gaps preventing 50 production incidents annually.
Integration with observability platforms enables intelligent responses. Prometheus metrics trigger alerts based on complex queries. Grafana dashboards visualize system state during incidents. Elasticsearch log analysis identifies correlated errors across systems. Jaeger distributed tracing reveals request flow disruptions. These tools reduced mean time to detection from 15 minutes to 30 seconds at Uber.
SLA Management and Reporting
Service Level Agreement structures for GPU infrastructure require careful definition given the complexity of failure modes. Availability SLAs typically target 99.9% monthly uptime, allowing 43 minutes of downtime. Performance SLAs guarantee minimum throughput rather than raw availability. Batch job SLAs focus on completion time rather than continuous availability. These nuanced agreements better reflect actual customer impact.
Credit calculations must account for partial failures affecting performance without complete outage. Linear degradation models provide proportional credits for reduced performance. Step functions trigger full credits at specific degradation thresholds. Time-weighted calculations account for incident duration and severity. Amazon EC2's credit model provides 10% credits for each 0.1% below SLA targets, motivating aggressive availability targets.
Exclusion clauses protect providers from circumstances beyond control. Force majeure events including natural disasters exempt SLA penalties. Scheduled maintenance windows don't count against availability metrics. Customer-induced failures from misconfiguration receive no credits. Emergency security patches may override maintenance windows. These exclusions require careful documentation and customer communication.
Reporting requirements demand accurate incident tracking and impact calculation. Automated systems calculate availability metrics from monitoring data. Monthly reports detail all incidents affecting SLA metrics. Root cause analysis appears for significant events. Improvement plans demonstrate commitment to service quality. Microsoft Azure publishes detailed monthly SLA reports building customer confidence despite occasional incidents.
Continuous improvement programs tie SLA performance to infrastructure investment. Repeated SLA breaches trigger mandatory infrastructure upgrades. Engineering resources prioritize based on SLA impact. Vendor selection includes historical SLA performance. Capacity planning incorporates SLA buffers. This approach improved Google Cloud's GPU availability from 99.5% to 99.95% over two years.
Lessons from Major Incidents
The February 2024 Anthropic training interruption affecting 2,000 H100 GPUs demonstrated cascade failure risks. Initial cooling pump failure triggered automatic failover, but backup pump control software contained a bug causing oscillating flow rates. Temperature variations induced thermal cycling stress, causing three GPU failures within minutes. Emergency shutdown prevented broader damage but lost 5 days of training worth $2.5 million. Software diversity between primary and backup systems now prevents common-mode failures.
Microsoft's April 2024 InfiniBand fabric collapse showed the importance of topology awareness. Spanning tree protocol misconfiguration created a broadcast storm affecting 10,000 GPUs across three availability zones. Standard network recovery procedures failed due to RDMA protocol interactions. Manual fabric segmentation and progressive reconstruction took 18 hours. Updated procedures now include RDMA-specific recovery steps validated through quarterly drills.
The June 2024 power distribution failure at a major cloud provider revealed single points of failure in redundant systems. Maintenance error triggered arc flash in primary switchgear, but protective relay misconfiguration prevented automatic transfer to backup power. Manual intervention required 47 minutes, exceeding UPS capacity. Three hundred GPUs suffered hard shutdowns with 5% experiencing hardware damage. Arc flash detection systems and relay coordination studies now occur annually.
CoreWeave's August 2024 distributed training incident highlighted checkpoint corruption risks. Network partition during checkpoint writing created inconsistent state across model shards. Recovery attempts using corrupted checkpoints propagated errors through healthy model components. Complete retraining from last known good checkpoint required 10 days. Atomic checkpoint protocols with validation now prevent partial write corruption.
The October 2024 Tesla Dojo thermal runaway event demonstrated exotic cooling failure modes. Coolant contamination gradually reduced heat transfer efficiency over months. Compensatory increases in flow rate masked degradation until pumps reached maximum capacity. Sudden load increase triggered thermal runaway affecting 50 custom training chips. Coolant quality monitoring and predictive maintenance now identify degradation before critical thresholds.
GPU cluster incidents demand specialized response procedures beyond traditional data center playbooks. Success requires deep technical knowledge, rehearsed procedures, and extensive automation. Organizations must invest in response capabilities proportional to their GPU infrastructure value. The difference between 30-minute and 3-hour recovery times may be millions in lost computation and SLA penalties.
Evolution from reactive to proactive incident management reduces both frequency and impact of failures. Predictive analytics identify problems before customer impact. Automated responses handle routine failures without human intervention. Chaos engineering validates procedures before real incidents occur. These capabilities transform incident response from crisis management to routine operations.
The playbooks presented here derive from collective industry experience operating millions of GPUs. Each incident teaches valuable lessons incorporated into future responses. Organizations implementing these procedures achieve faster recovery times, reduced customer impact, and lower operational costs. As GPU deployments scale toward exascale computing, robust incident response becomes essential for sustainable operations.
References
OpenAI. "Incident Response Procedures for Large-Scale Training Infrastructure." OpenAI Engineering Blog, 2024.
Microsoft Azure. "GPU Cluster Incident Management: Lessons from 100,000 GPUs." Azure Reliability Documentation, 2024.
Meta. "Automated Incident Response for AI Infrastructure at Scale." Meta Engineering, 2024.
Google Cloud. "SRE Practices for GPU Infrastructure Reliability." Google SRE Workbook, 2024.
Anthropic. "Post-Mortem: February 2024 Training Infrastructure Incident." Anthropic Engineering, 2024.
NVIDIA. "Enterprise Support Escalation Procedures for Critical GPU Failures." NVIDIA Documentation, 2024.
CoreWeave. "Building Resilient GPU Infrastructure: Incident Response Automation." CoreWeave Technical Blog, 2024.
Amazon Web Services. "EFA Network Failure Recovery Procedures." AWS HPC Documentation, 2024.
Key takeaways
For operations teams: - H100 downtime costs $25-40K per GPU-day; 500 H100s offline = $1,200/second in lost compute - Severity 1: >100 GPUs or $50K/hour impact triggers CEO involvement, vendor engagement, 24/7 war room - Automated systems handle 70% of Severity 3 incidents (<20 GPUs) without human intervention
For infrastructure engineers: - Liquid cooling failures: GPU temps rise 20°C/minute without active cooling; 1°C every 30 seconds toward thermal shutdown - Emergency power response: UPS provides 5-7 minutes at full load; first 30 seconds determine incident trajectory - InfiniBand partition destroys distributed training instantly; NCCL error handling terminates affected collective operations
For SRE teams: - Checkpoint frequency: 10-15 min minimizes training loss from incidents - OpenAI discarded 3 days GPT-4 training after detecting checkpoint corruption from power incident - Lambda Labs cost formula: (GPUs × hourly rate × duration) + (checkpoint recovery × cluster cost) + SLA penalties
For reliability planning: - February 2024 Anthropic: cooling pump failure cascaded through 2,000 H100s, lost 5 days training ($2.5M) - Microsoft April 2024: InfiniBand spanning tree misconfiguration affected 10,000 GPUs across three AZs (18-hour recovery) - Netflix Ansible automation handles 80% of incidents without human intervention; GKE achieves 99.95% availability