GPU Performance Tuning: Maximizing Throughput for LLM Training and Inference

Blake Crosley

Mar 03, 2026 12 min read Disclaimer

GPU Performance Tuning: Maximizing Throughput for LLM Training and Inference

December 2025 Update: FP8 training now production-ready on H100/H200 and Blackwell, delivering 2x throughput over FP16 with equivalent accuracy. Flash Attention 3 optimized for Hopper architecture achieving 1.5-2x speedup. vLLM 0.6+ and TensorRT-LLM delivering 3-5x inference throughput improvements through continuous batching and speculative decoding. torch.compile with Triton backend now default for PyTorch 2.4+. NVIDIA NeMo Framework 2.0 providing end-to-end optimized training pipelines.

A perfectly configured 8-GPU node achieves 98% theoretical FLOPS while a poorly tuned identical system struggles at 43%, wasting $380,000 annually in underutilized hardware.¹ MLPerf benchmarks reveal that top performers extract 2.3x more throughput from identical H100 GPUs compared to median submissions, with the difference entirely attributable to software optimization rather than hardware advantages.² The gap between theoretical and achieved performance haunts every AI team, where a single misconfigured parameter can double training time or triple inference costs. Organizations mastering GPU performance tuning complete model training 60% faster and serve inference requests at 40% lower cost per token than competitors using default configurations.

NVIDIA's optimization guides span 1,200 pages across different frameworks, kernels, and configurations, yet most teams implement less than 20% of available optimizations due to complexity and time constraints.³ A typical LLM training run involves over 300 tunable parameters affecting memory allocation, kernel scheduling, communication patterns, and numerical precision. Each parameter interacts with others in non-linear ways: increasing batch size improves GPU utilization but may trigger out-of-memory errors or degrade convergence. The optimization space becomes so vast that exhaustive search proves impossible, requiring systematic approaches that balance performance gains against engineering effort.

Memory bandwidth bottlenecks limit LLM performance

Modern LLMs hit memory walls long before compute limits. The H100's 3.35TB/s memory bandwidth serves 1,979 TFLOPS of compute, creating a 591:1 compute-to-memory ratio.⁴ LLM inference reads model weights repeatedly for each token generation, making memory bandwidth the binding constraint. A 70B parameter model at FP16 precision requires 140GB just for weights, consuming the entire H100 memory with minimal space for activations and KV cache.

Memory optimization starts with understanding access patterns. Sequential reads achieve 95% of theoretical bandwidth while random access drops to 15%. LLMs exhibit mixed patterns: weight reads remain sequential but attention mechanisms create irregular access to key-value caches. Optimizing memory layout improves throughput dramatically. Row-major versus column-major storage changes memory access efficiency by 4x for certain operations. Padding tensors to align with 128-byte boundaries increases bandwidth utilization from 72% to 91%.⁵

Flash Attention revolutionizes memory efficiency by fusing operations and reducing HBM accesses. Standard attention mechanisms write intermediate matrices to HBM, consuming bandwidth for temporary data. Flash Attention computes attention in SRAM tiles, reducing memory traffic by 10-20x.⁶ The optimization enables 4x longer context lengths and 2.4x faster training for models like GPT-3. Implementation requires careful tile size selection based on GPU architecture: H100s optimal tile size differs from A100s due to increased SRAM capacity.

Batch size optimization balances throughput and convergence

Larger batches improve GPU utilization but affect model convergence unpredictably. Each GPU executes most efficiently at specific batch size multiples determined by Tensor Core dimensions. H100 Tensor Cores process FP16 operations in 16x16 matrix tiles, making batch sizes divisible by 16 optimal.⁷ Batch size 127 achieves only 61% utilization while batch size 128 reaches 94%. The dramatic difference stems from hardware scheduling aligning perfectly with power-of-2 dimensions.

Gradient accumulation enables large effective batch sizes without memory constraints. Training with batch size 2048 might exceed memory, but accumulating gradients over 32 steps of batch size 64 achieves equivalent results. The technique maintains mathematical equivalence while fitting within memory limits. Communication overhead increases slightly as gradient synchronization happens less frequently. Smart implementations overlap gradient computation with communication, hiding latency entirely.

Dynamic batch sizing adapts to varying sequence lengths in LLM training. Fixed batch sizes waste computation on padding tokens when sequences vary in length. Dynamic batching packs sequences efficiently, improving throughput by 20-35%.⁸ Implementation complexity increases as memory allocation becomes unpredictable. Pre-allocation strategies with pooling prevent fragmentation while maintaining performance.

Mixed precision training accelerates without accuracy loss

Training in FP16 doubles throughput compared to FP32 while maintaining model quality through careful numerical management. Tensor Cores achieve 312 TFLOPS in FP32 but 989 TFLOPS in FP16 on H100 GPUs.⁹ The 3.2x compute advantage combines with 2x memory savings, enabling larger models or batch sizes. Automatic Mixed Precision (AMP) frameworks handle precision management transparently, but understanding internals enables better optimization.

Loss scaling prevents gradient underflow in FP16 training. Gradients often fall below FP16's minimum representable value (5.96e-8), appearing as zeros and stopping learning.¹⁰ Multiplying loss by 2^16 shifts gradients into FP16's representable range. Dynamic loss scaling adjusts the multiplier based on gradient statistics, preventing both underflow and overflow. Optimal scaling factors vary by model architecture and dataset.

Master weight copies in FP32 preserve update precision while computing in FP16. Small gradient updates to large weights disappear in FP16 arithmetic. Maintaining weights in FP32 accumulates updates precisely. The overhead adds 50% memory for weights but negligible compute cost. Advanced implementations use stochastic rounding to inject appropriate noise, improving convergence in some cases.

Kernel fusion eliminates memory bottlenecks

GPU kernels launching individually create memory traffic for intermediate results. A simple layer normalization involves separate kernels for mean, variance, subtraction, division, and scaling. Each kernel reads from and writes to HBM, consuming 5x the necessary bandwidth. Fused kernels compute entire operations in registers and shared memory, touching HBM only for input and output.

Custom kernels optimize specific model architectures. Standard GEMM kernels handle general matrix multiplication but miss optimization opportunities in transformer blocks. Specialized kernels for attention, feedforward networks, and layer normalization improve throughput 30-50%.¹¹ Development requires CUDA expertise and architecture-specific tuning. Libraries like Apex and TransformerEngine provide optimized kernels for common operations.

Compilation frameworks automate kernel fusion through graph optimization. PyTorch's torch.compile analyzes computation graphs and generates fused kernels automatically.¹² XLA similarly optimizes TensorFlow and JAX models. Compilation overhead amortizes over long training runs. Initial compilation takes minutes but subsequent iterations run 20-40% faster. Profile-guided optimization further improves performance by specializing for observed input shapes.

Communication optimization for distributed training

Multi-GPU training requires careful optimization of communication patterns. NCCL (NVIDIA Collective Communications Library) provides optimized primitives but requires proper configuration. Ring allreduce theoretically achieves bandwidth-optimal communication, but real implementations suffer from synchronization overhead. Tree algorithms reduce latency for small messages while ring algorithms maximize throughput for large transfers.

Network topology awareness improves communication efficiency dramatically. GPUs connected via NVLink achieve 900GB/s bidirectional bandwidth while PCIe limits to 64GB/s.¹³ Placement strategies that co-locate frequently communicating GPUs on NVLink-connected nodes reduce communication time by 5x. Hierarchical allreduce performs local reduction over NVLink before inter-node communication over InfiniBand.

Gradient compression reduces communication volume at minimal accuracy cost. Transmitting only top-k gradients or quantizing to INT8 reduces traffic by 100-1000x.¹⁴ Error feedback mechanisms accumulate truncated gradients for future iterations. Compression ratios depend on model sparsity and gradient distribution. Adaptive schemes adjust compression based on training phase, using less compression during critical convergence periods.

Introl's performance engineering teams have optimized over 10,000 GPU deployments across our global coverage area, consistently achieving 85-95% of theoretical performance for LLM workloads.¹⁵ Our optimization playbooks reduce time-to-deployment by 40% while ensuring maximum hardware utilization from day one.

Inference-specific optimizations

Inference optimization differs fundamentally from training optimization. Latency matters more than throughput for user-facing applications. Memory bandwidth becomes the bottleneck rather than compute. Serving costs dominate total expenses, making efficiency crucial.

Key-value cache management determines inference efficiency. Each token generation reads the entire KV cache, consuming memory bandwidth proportional to sequence length. PagedAttention virtualizes KV cache memory, reducing waste from 60% to under 5%.¹⁶ The technique enables 4x higher throughput for long sequences. Implementation requires careful memory pool management and request scheduling.

Quantization reduces model size and bandwidth requirements. INT8 quantization halves memory usage while maintaining 99% of FP16 accuracy for most models.¹⁷ INT4 achieves 4x compression with 97% accuracy retention. Quantization-aware training produces models robust to reduced precision. Post-training quantization works for many models but requires calibration dataset selection.

Continuous batching maximizes inference throughput by starting new requests as soon as capacity becomes available. Static batching waits for all requests to complete before starting new ones, wasting resources on short sequences. Continuous batching improves throughput by 2.5x for variable-length requests.¹⁸ Implementation complexity increases due to dynamic memory management and scheduling requirements.

Real-world optimization results

Case Study 1: Financial Services LLM Training - Model: 70B parameter custom architecture - Hardware: 64x H100 GPUs - Baseline: 847 tokens/second/GPU - Optimizations: Flash Attention, mixed precision, gradient accumulation - Result: 1,923 tokens/second/GPU (2.27x improvement) - Training time reduced from 18 days to 8 days - Cost savings: $240,000 per training run

Case Study 2: Healthcare Inference System - Model: 13B parameter medical assistant - Hardware: 8x A100 GPUs - Baseline: 142ms per token latency, 820 tokens/second throughput - Optimizations: PagedAttention, INT8 quantization, continuous batching - Result: 47ms latency, 2,140 tokens/second (2.6x throughput) - Cost per million tokens: $0.73 → $0.28

Case Study 3: E-commerce Recommendation Engine - Model: 175B parameter MoE model - Hardware: 128x H100 GPUs - Baseline: 43% MFU (Model FLOPS Utilization) - Optimizations: Expert parallelism, kernel fusion, topology-aware placement - Result: 71% MFU (1.65x improvement) - Inference cost reduction: $1.8M annually

Performance monitoring and profiling

Continuous profiling identifies optimization opportunities and prevents performance regression. Key metrics include:

GPU Metrics: Utilization, memory bandwidth, tensor core usage, and power consumption reveal bottlenecks. DCGM (Data Center GPU Manager) provides comprehensive monitoring.¹⁹ Custom dashboards track metrics specific to LLM workloads.

Framework Metrics: PyTorch Profiler and TensorBoard track operation-level performance. Identify slow operators, memory allocation patterns, and synchronization points. Chrome tracing format enables detailed timeline analysis.

System Metrics: CPU utilization, network traffic, and storage I/O affect overall performance. Dataloader bottlenecks appear as low GPU utilization despite high CPU usage. Network congestion manifests as communication primitive slowdowns.

Optimization checklist for production deployments

Before deploying LLM workloads, verify these optimizations:

Memory Optimizations: - [ ] Flash Attention or equivalent enabled - [ ] Gradient checkpointing configured appropriately - [ ] Memory pool pre-allocation implemented - [ ] Tensor memory alignment verified

Compute Optimizations: - [ ] Mixed precision training enabled - [ ] Batch sizes aligned with Tensor Core dimensions - [ ] Kernel fusion applied where possible - [ ] Compilation frameworks evaluated

Communication Optimizations: - [ ] NCCL environment variables tuned - [ ] Network topology reflected in process placement - [ ] Gradient compression evaluated - [ ] Communication/computation overlap maximized

Inference Optimizations: - [ ] Quantization applied where acceptable - [ ] Continuous batching implemented - [ ] KV cache optimization deployed - [ ] Request scheduling optimized

Organizations that systematically apply GPU performance tuning achieve transformational improvements in AI capabilities. The difference between default and optimized configurations often exceeds 2x in performance, translating to millions in saved compute costs and weeks of reduced training time. While the optimization space remains vast and complex, methodical application of proven techniques ensures teams extract maximum value from their GPU investments. The competitive advantage goes to those who master the art and science of GPU optimization.

References

NVIDIA. "Achieving Peak GPU Performance." NVIDIA Developer Blog, 2024. https://developer.nvidia.com/blog/achieving-peak-gpu-performance/
MLCommons. "MLPerf Training v3.1 Results Analysis." MLCommons Association, 2024. https://mlcommons.org/benchmarks/training/
NVIDIA. "Deep Learning Performance Documentation." NVIDIA Documentation, 2024. https://docs.nvidia.com/deeplearning/performance/
———. "H100 Tensor Core GPU Architecture." NVIDIA Whitepaper, 2024. https://resources.nvidia.com/en-us-tensor-core/h100-architecture
Harris, Mark. "Optimizing Memory Bandwidth Utilization." NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/optimizing-memory-bandwidth/
Dao, Tri, et al. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." Stanford University, 2024. https://arxiv.org/abs/2307.08691
NVIDIA. "Tensor Core Performance Guide." NVIDIA Documentation, 2024. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/
Effective Batch Size Research Group. "Dynamic Batching for Variable-Length Sequences." NVIDIA Research, 2024. https://research.nvidia.com/publication/2024-dynamic-batching
NVIDIA. "H100 FP8 Training and Inference." NVIDIA Developer, 2024. https://developer.nvidia.com/blog/h100-fp8-training-inference/
Micikevicius, Paulius, et al. "Mixed Precision Training." NVIDIA Research, 2024. https://arxiv.org/abs/1710.03740
NVIDIA. "TransformerEngine: Accelerating Transformer Models." NVIDIA GitHub, 2024. https://github.com/NVIDIA/TransformerEngine
PyTorch Team. "torch.compile Performance Guide." PyTorch Documentation, 2024. https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html
NVIDIA. "NVLink and NVSwitch Performance Analysis." NVIDIA Technical Documentation, 2024. https://docs.nvidia.com/nvlink-nvswitch-performance/
Lin, Yujun, et al. "Deep Gradient Compression." MIT and Tsinghua University, 2024. https://arxiv.org/abs/1712.01887
Introl. "GPU Performance Optimization Services." Introl Corporation, 2024. https://introl.com/coverage-area
Kwon, Woosuk, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." UC Berkeley, 2024. https://arxiv.org/abs/2309.06180
Dettmers, Tim, et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." University of Washington, 2024. https://arxiv.org/abs/2208.07339
Yu, Gyeong-In, et al. "Orca: A Distributed Serving System for Transformer-Based Language Models." USENIX, 2024. https://www.usenix.org/conference/osdi22/presentation/yu
NVIDIA. "Data Center GPU Manager User Guide." NVIDIA Documentation, 2024. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/
Shoeybi, Mohammad, et al. "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism." NVIDIA Research, 2024. https://arxiv.org/abs/1909.08053
Rajbhandari, Samyam, et al. "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." Microsoft Research, 2024. https://arxiv.org/abs/1910.02054
Pope, Reiner, et al. "Efficiently Scaling Transformer Inference." Google Research, 2024. https://arxiv.org/abs/2211.05102
Aminabadi, Reza Yazdani, et al. "DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale." Microsoft, 2024. https://arxiv.org/abs/2207.00032
Hooper, Coleman, et al. "kvquant: 4-bit Quantization of KV Cache." UC Berkeley, 2024. https://arxiv.org/abs/2401.18079
Shazeer, Noam. "Fast Transformer Inference with Better Parallelism." Google Research, 2024. https://arxiv.org/abs/2305.13245

Key takeaways

For ML engineers: - Flash Attention reduces HBM accesses 10-20x, enabling 4x longer context and 2.4x faster training - Batch sizes divisible by 16 achieve 94% utilization vs 61% for batch 127 (Tensor Core alignment) - FP16 delivers 3.2x compute advantage with 2x memory savings vs FP32

For infrastructure architects: - MLPerf: top performers extract 2.3x more throughput from identical H100s vs median - Memory bandwidth (3.35TB/s on H100) is binding constraint for LLM inference, not compute - Network topology awareness reduces communication time 5x (NVLink 900GB/s vs PCIe 64GB/s)

For operations teams: - Case study: 70B model training reduced from 18 days to 8 days ($240K savings per run) - Healthcare inference: latency 142ms → 47ms, cost per million tokens $0.73 → $0.28 - E-commerce 175B MoE: utilization 43% → 71%, annual savings $1.8M

For cost optimization: - PagedAttention reduces KV cache waste from 60% to <5%, enabling 4x higher throughput - INT8 quantization halves memory with 99% accuracy; INT4 achieves 4x compression at 97% - Continuous batching improves throughput 2.5x for variable-length requests

Memory bandwidth bottlenecks limit LLM performance

Batch size optimization balances throughput and convergence

Mixed precision training accelerates without accuracy loss

Kernel fusion eliminates memory bottlenecks

Communication optimization for distributed training

Inference-specific optimizations

Real-world optimization results

Performance monitoring and profiling

Optimization checklist for production deployments

References

Key takeaways

You Might Also Like

Securing AI Infrastructure: Zero-Trust Architecture for GPU ...

MLOps Infrastructure: CI/CD Pipelines for Model Training and...

Claude Code CLI: The Definitive Technical Reference

Request a Quote_

Request Received_