← Back to Blog

TensorRT-LLM Optimization: Mastering NVIDIA's Inference Stack

TensorRT-LLM achieving 10,000+ output tokens/sec on H100 with FP8, sub-100ms TTFT. Production deployments reporting 4x throughput vs native PyTorch. Kernel fusion combining LayerNorm, matmuls,...

Blake Crosley

Mar 04, 2025

TensorRT-LLM Optimization: Mastering NVIDIA's Inference Stack

Updated December 11, 2025

December 2025 Update: TensorRT-LLM achieving 10,000+ output tokens/sec on H100 with FP8, sub-100ms TTFT. Production deployments reporting 4x throughput vs native PyTorch. Kernel fusion combining LayerNorm, matmuls, activations into single CUDA kernels. Inflight batching maximizing GPU utilization. FP8 attention on Hopper/Blackwell delivering additional speedups.

NVIDIA's TensorRT-LLM delivers raw inference performance that alternatives struggle to match. On H100 GPUs with FP8 precision, the framework achieves over 10,000 output tokens per second at peak throughput with time-to-first-token latencies below 100 milliseconds.¹ Production deployments report up to 4x throughput improvements over native PyTorch inference. The performance comes at a cost: TensorRT-LLM requires more configuration expertise and longer optimization cycles than user-friendly alternatives like vLLM.

For organizations committed to NVIDIA hardware and willing to invest engineering time in optimization, TensorRT-LLM extracts maximum performance from expensive GPU infrastructure. Understanding the framework's architecture, quantization options, and tuning parameters enables teams to build inference systems that justify premium hardware investments through superior token economics.

Architecture and core optimizations

TensorRT-LLM builds on NVIDIA's TensorRT inference optimizer, extending the compilation framework with transformer-specific optimizations. The library provides Python APIs for model definition alongside C++ runtime components for production deployment.

Kernel fusion: TensorRT-LLM combines multiple transformer operations into single optimized CUDA kernels. LayerNorm, matrix multiplications, bias additions, and activation functions execute together rather than requiring separate kernel launches and memory transfers. Fusion reduces kernel launch overhead and eliminates intermediate tensor materialization.²

Custom attention kernels: Hand-optimized implementations of multi-head and grouped-query attention leverage Tensor Core instructions for maximum throughput. Flash Attention variants reduce memory bandwidth requirements while maintaining numerical precision. FP8 attention kernels on Hopper and Blackwell GPUs provide additional speedups.

Inflight batching: Traditional static batching forces all requests in a batch to wait for the longest sequence to complete. Inflight batching adds new requests to running batches at each generation step, processing context and generation phases together.³ The approach maximizes GPU utilization by keeping compute units busy even as individual requests complete.

Paged KV caching: Inspired by operating system virtual memory, paged attention allocates KV cache in non-contiguous blocks rather than requiring continuous memory regions.⁴ Block-level allocation enables sharing KV cache among requests with common prefixes and achieves near-zero memory waste from internal fragmentation.

Performance comparison: TensorRT-LLM vs vLLM

Both frameworks target production LLM inference, but architectural differences create distinct performance profiles:

Metric	TensorRT-LLM	vLLM
Peak throughput (Llama 70B, A100)	~700 tokens/sec	~600-650 tokens/sec
Time-to-first-token	35-50ms	50-80ms
Short sequence throughput advantage	1.34x	Baseline
Long sequence TPOT advantage	2.72x	Baseline
Setup complexity	High (weeks)	Low (hours)

TensorRT-LLM consistently outperforms vLLM with default configurations, delivering 1.34x higher throughput on short sequences and 2.72x better time-per-output-token on long sequences.⁵ On B200 GPUs, TensorRT-LLM's deeper optimization for Blackwell architecture extends the performance gap further.

vLLM offers advantages in developer experience:⁶ - OpenAI-compatible API for drop-in replacement - Simpler deployment without compilation step - Automatic model optimization with sensible defaults - Broader hardware support beyond NVIDIA GPUs

Recommendation: Deploy TensorRT-LLM when maximizing hardware efficiency justifies engineering investment. Choose vLLM for faster time-to-production or when operating at smaller scale where absolute performance matters less than development velocity.

Quantization strategies

TensorRT-LLM supports extensive quantization options for trading precision against performance and memory efficiency. Choosing the right quantization method depends on batch size, accuracy requirements, and target hardware.

FP8 quantization (recommended first)

FP8 provides the best balance of performance improvement with minimal accuracy degradation:⁷

python quantize.py \
    --model_dir $MODEL_PATH \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir $OUTPUT_PATH

FP8 quantization requires calibration to determine appropriate scaling factors. The calibration process runs inference on representative samples to measure activation ranges:

from tensorrt_llm.quantization import QuantConfig, CalibConfig

quant_config = QuantConfig(
    quant_algo="fp8",
    kv_cache_quant_algo="fp8"
)

calib_config = CalibConfig(
    calib_dataset="cnn_dailymail",
    calib_batch_size=8,
    calib_num_samples=512
)

FP8 delivers medium performance improvement with very low accuracy impact and requires only minutes for calibration. Hopper and Blackwell GPUs provide hardware FP8 support; Ada GPUs support FP8 with reduced efficiency.

INT4 AWQ for memory-constrained deployments

When memory limits model size, INT4 Activation-aware Weight Quantization compresses weights to 4 bits while maintaining acceptable accuracy:⁸

python quantize.py \
    --model_dir $MODEL_PATH \
    --qformat int4_awq \
    --awq_block_size 64 \
    --tp_size 4 \
    --output_dir $OUTPUT_PATH

INT4 AWQ excels in small-batch scenarios (batch size ≤ 4) where inference becomes memory-bound. Weight loading time dominates computation, so aggressive weight compression provides substantial speedups. For large batches, INT4 AWQ's performance advantage diminishes as computation density increases.

INT8 SmoothQuant for balanced optimization

SmoothQuant migrates quantization difficulty from activations to weights, enabling effective INT8 quantization without significant accuracy loss:

python quantize.py \
    --model_dir $MODEL_PATH \
    --qformat int8_sq \
    --kv_cache_dtype int8 \
    --output_dir $OUTPUT_PATH

INT8 SmoothQuant provides medium performance improvement with medium accuracy impact. Organizations should try FP8 first, falling back to INT8 SQ if FP8 results don't meet requirements.

Quantization selection framework

NVIDIA recommends the following priority order:⁹

FP8 - Best performance/accuracy tradeoff, requires Hopper/Blackwell
INT8 SmoothQuant - Good alternative for Ada GPUs or when FP8 accuracy insufficient
INT4 AWQ/GPTQ - Maximum compression for memory-constrained scenarios

For KV cache specifically, FP8 quantization is recommended over INT8 on Hopper and Ada GPUs due to lower accuracy impact in most cases.

Production deployment configuration

Optimal TensorRT-LLM deployment requires tuning multiple parameters based on workload characteristics:

Engine build configuration

trtllm-build \
    --checkpoint_dir $CHECKPOINT_PATH \
    --output_dir $ENGINE_PATH \
    --max_batch_size 256 \
    --max_num_tokens 8192 \
    --max_input_len 4096 \
    --max_seq_len 8192 \
    --gemm_plugin auto \
    --use_paged_context_fmha enable \
    --workers 8

max_batch_size: Default 256 in recent versions. Production deployments achieving maximum throughput often increase to 2048, fully leveraging inflight batching capabilities.¹⁰

max_num_tokens: Controls total tokens processed per batch iteration. Default 8192 balances throughput with memory consumption. Reduce for memory-constrained deployments; increase cautiously with monitoring.

use_paged_context_fmha: Enables paged attention for efficient KV cache management. Required when using inflight batching. The implementation pre-allocates KV cache memory, requiring approximately 60% more VRAM than model weights alone.¹¹

Triton Inference Server integration

Production deployments typically use NVIDIA Triton Inference Server with the TensorRT-LLM backend:

model_repository/
└── llama-70b/
    ├── 1/
    │   └── model.py
    ├── config.pbtxt
    └── tensorrt_llm/
        └── 1/
            ├── config.json
            └── engine/

Triton provides multi-model orchestration, request queuing, metrics collection, and Kubernetes-native scaling. The pre-built NGC container includes TensorRT-LLM backend with inflight batching and paged KV cache support enabled.

Memory planning

Estimate memory requirements before deployment:

Total VRAM = Model Weights + KV Cache + Activation Memory + Runtime Overhead

Model Weights (FP8): Parameters × 1 byte
Model Weights (INT4): Parameters × 0.5 bytes
KV Cache: batch_size × seq_len × num_layers × 2 × hidden_dim × precision_bytes

A 70B parameter model in FP8 requires approximately: - Weights: 70GB - KV Cache (batch 256, seq 8192): ~120GB - Activations + overhead: ~30GB - Total: ~220GB (3x H100 80GB or 2x H200 141GB)

Performance tuning workflow

Systematic optimization extracts maximum performance from TensorRT-LLM deployments:

Phase 1: Baseline measurement

Use trtllm-bench for quick performance assessment:

python -m tensorrt_llm.bench \
    --model_dir $ENGINE_PATH \
    --input_len 512 \
    --output_len 256 \
    --batch_size 32 \
    --num_requests 1000

The benchmarking utility sets optimal engine parameters automatically, providing baseline performance without full Triton deployment complexity.¹²

Phase 2: Quantization selection

Test FP8 first against accuracy requirements. If accuracy degrades beyond acceptable thresholds, evaluate INT8 SQ or INT4 AWQ. Run evaluation benchmarks on representative tasks, not just perplexity measurements.

Phase 3: Batch size optimization

Profile throughput across batch sizes from 1 to max_batch_size. Identify the knee of the throughput curve where additional batching provides diminishing returns. Set max_batch_size 20-30% above this point to accommodate traffic spikes.

Phase 4: KV cache tuning

Monitor KV cache utilization during production workloads. If utilization consistently exceeds 80%, increase max_num_tokens or reduce max_batch_size. If utilization stays below 50%, reduce allocation to free memory for larger batches.

Phase 5: Continuous monitoring

Track key metrics in production: - Tokens per second (throughput) - Time-to-first-token (latency) - Queue depth (capacity) - KV cache utilization (memory) - GPU utilization (efficiency)

Advanced optimizations

Speculative decoding

TensorRT-LLM supports speculative decoding using smaller draft models to predict multiple tokens verified by the main model. The technique provides 1.5-2x speedup for compatible workloads:

# Enable speculative decoding in engine build
trtllm-build \
    --speculative_decoding_mode draft_tokens_external \
    --max_draft_len 5 \
    ...

Speculative decoding benefits latency-sensitive applications where time-to-completion matters more than throughput. The optimization requires maintaining both draft and target models in memory.

Multi-GPU configurations

TensorRT-LLM supports tensor parallelism (TP), pipeline parallelism (PP), and expert parallelism (EP) for distributed inference:

# 4-way tensor parallelism
trtllm-build \
    --tp_size 4 \
    --pp_size 1 \
    ...

TP splits each layer across GPUs, requiring all-reduce operations at each layer boundary. PP splits layers across GPUs in pipeline stages. For inference, TP typically provides better latency while PP enables larger model deployment. MoE models benefit from EP for distributing experts across GPUs.

Blackwell optimizations

Recent TensorRT-LLM releases include Blackwell-specific optimizations:¹³ - DeepSeek-R1 performance optimization for B200 - NVFP4 (4-bit) quantization support - Enhanced MXFP8 with microscaling - PDL support for FP8 GEMM plugins

Organizations deploying on Blackwell GPUs should update to latest TensorRT-LLM versions to access hardware-specific optimizations unavailable on older releases.

When to choose TensorRT-LLM

The framework delivers maximum value under specific conditions:

Strong fit: - Committed to NVIDIA GPU infrastructure long-term - Engineering team with CUDA/TensorRT experience - Workloads where 20-40% performance improvement justifies weeks of optimization - Production scale where hardware savings exceed engineering costs - Latency-critical applications requiring sub-50ms TTFT

Weak fit: - Rapid prototyping or experimental deployments - Teams lacking NVIDIA optimization expertise - Multi-cloud requirements spanning GPU vendors - Small-scale deployments where vLLM performance suffices - Frequent model changes requiring re-optimization

Organizations deploying TensorRT-LLM at scale can leverage Introl's GPU infrastructure expertise for hardware planning and deployment across global data center locations.

The optimization investment

TensorRT-LLM represents NVIDIA's answer to the inference efficiency challenge. The framework extracts performance that justifies H100 and B200 premium pricing through deep hardware integration unavailable in portable alternatives. Adobe achieved 60% reduction in diffusion latency using Model Optimizer with TensorRT-LLM, demonstrating production impact at enterprise scale.¹⁴

The tradeoff remains clear: TensorRT-LLM demands engineering investment measured in weeks rather than days. Organizations running inference workloads exceeding $50,000 monthly should evaluate TensorRT-LLM seriously, as even modest percentage improvements yield substantial dollar savings. Smaller deployments may find vLLM or similar frameworks provide adequate performance with dramatically lower integration costs.

The December 2025 releases adding DeepSeek support, NVFP4 quantization, and Blackwell optimizations demonstrate NVIDIA's continued investment in the framework. For teams building inference infrastructure on NVIDIA hardware, TensorRT-LLM mastery becomes a competitive advantage that compounds over time through accumulated optimization expertise and production-hardened deployment configurations.

References

NVIDIA. "LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM." NVIDIA Technical Blog. July 2025. https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/
Unite.AI. "TensorRT-LLM: A Comprehensive Guide to Optimizing Large Language Model Inference for Maximum Performance." 2025. https://www.unite.ai/tensorrt-llm-a-comprehensive-guide-to-optimizing-large-language-model-inference-for-maximum-performance/
NVIDIA. "NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching." NVIDIA Technical Blog. 2025. https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
SoftwareMill. "Boosting LLMs performance in production." 2025. https://softwaremill.com/boosting-llms-performance-in-production/
SqueezeBits. "[vLLM vs TensorRT-LLM] #1. An Overall Evaluation." 2025. https://blog.squeezebits.com/vllm-vs-tensorrtllm-1-an-overall-evaluation-30703
MarkTechPost. "vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy: A Deep Technical Comparison for Production LLM Inference." November 19, 2025. https://www.marktechpost.com/2025/11/19/vllm-vs-tensorrt-llm-vs-hf-tgi-vs-lmdeploy-a-deep-technical-comparison-for-production-llm-inference/
NVIDIA. "Optimizing LLMs for Performance and Accuracy with Post-Training Quantization." NVIDIA Technical Blog. August 2025. https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/
NVIDIA. "FP8 Quantization." TensorRT-LLM Documentation. Accessed December 8, 2025. https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/fp8-quantization.html
NVIDIA. "Best practices to choose the right quantization methods." Model Optimizer Documentation. 2025. https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_choosing_quant_methods.html
BentoML. "Best Practices for Tuning TensorRT-LLM for Optimal Serving with BentoML." 2025. https://www.bentoml.com/blog/tuning-tensor-rt-llm-for-optimal-serving-with-bentoml
GitHub. "Guides or Tips for optimization for KV cache usage with inflight batcher." TensorRT-LLM Issues #1012. https://github.com/NVIDIA/TensorRT-LLM/issues/1012
NVIDIA. "Welcome to TensorRT LLM's Documentation." Accessed December 8, 2025. https://nvidia.github.io/TensorRT-LLM/
NVIDIA. "Release Notes." TensorRT-LLM Documentation. December 2025. https://nvidia.github.io/TensorRT-LLM/release-notes.html
GitHub. "NVIDIA/TensorRT-Model-Optimizer." Accessed December 8, 2025. https://github.com/NVIDIA/TensorRT-Model-Optimizer

Key takeaways

For infrastructure architects: - 10,000+ output tokens/sec on H100 with FP8; 4x throughput over native PyTorch; <100ms time-to-first-token - TensorRT-LLM vs vLLM: 1.34x higher throughput (short sequences), 2.72x better TPOT (long sequences); setup takes weeks vs hours - 70B model in FP8 requires ~220GB: weights 70GB + KV cache ~120GB + overhead ~30GB—needs 3x H100 80GB or 2x H200 141GB

For ML engineers: - Quantization priority: FP8 (best tradeoff, Hopper/Blackwell) → INT8 SmoothQuant (Ada fallback) → INT4 AWQ (memory-constrained) - FP8 KV cache recommended over INT8 on Hopper/Ada due to lower accuracy impact in most cases - INT4 AWQ excels in small-batch (≤4) scenarios where inference is memory-bound; diminishing returns at large batches

For operations teams: - Inflight batching: adds new requests to running batches at each generation step—maximizes GPU utilization - Paged KV caching: block-level allocation enables prefix sharing, near-zero fragmentation; requires ~60% more VRAM - Key production metrics: tokens/sec (throughput), TTFT (latency), queue depth, KV cache utilization, GPU utilization

For deployment teams: - Triton Inference Server with TensorRT-LLM backend provides multi-model orchestration, request queuing, Kubernetes scaling - trtllm-bench enables quick baseline measurement before full Triton deployment - Speculative decoding adds 1.5-2x latency improvement for compatible workloads; requires maintaining draft + target models

SEO Elements

Squarespace Excerpt (158 characters)

TensorRT-LLM delivers 4x throughput over PyTorch with FP8 quantization and inflight batching. Complete optimization guide for NVIDIA H100 and Blackwell GPUs.

SEO Title (54 characters)

TensorRT-LLM Optimization: NVIDIA Inference Guide 2025

SEO Description (155 characters)

Master TensorRT-LLM with FP8 quantization, inflight batching, and paged KV cache. Achieve 10,000+ tokens/sec on H100 GPUs with production tuning strategies.

Title Review

Current title "TensorRT-LLM Optimization: Mastering NVIDIA's Inference Stack" works at 55 characters. Alternatives: - "TensorRT-LLM: Complete NVIDIA Inference Optimization Guide" (56 chars) - "TensorRT-LLM Production Guide: FP8, Batching, KV Cache" (52 chars)

URL Slug Recommendations

Primary: tensorrt-llm-optimization-nvidia-inference-stack-guide Alternative 1: tensorrt-llm-fp8-quantization-production-guide-2025 Alternative 2: tensorrt-llm-vs-vllm-performance-optimization-guide Alternative 3: nvidia-tensorrt-llm-inference-tuning-complete-guide

TensorRT-LLM Optimization: Mastering NVIDIA's Inference Stack

Architecture and core optimizations

Performance comparison: TensorRT-LLM vs vLLM

Quantization strategies

FP8 quantization (recommended first)

INT4 AWQ for memory-constrained deployments

INT8 SmoothQuant for balanced optimization

Quantization selection framework

Production deployment configuration

Engine build configuration

Triton Inference Server integration

Memory planning

Performance tuning workflow

Advanced optimizations

Speculative decoding

Multi-GPU configurations

Blackwell optimizations

When to choose TensorRT-LLM

The optimization investment

References

Key takeaways

SEO Elements

Squarespace Excerpt (158 characters)

SEO Title (54 characters)

SEO Description (155 characters)

Title Review

URL Slug Recommendations

Request a Quote_

Request Received_