Speculative Decoding: Achieving 2-3x LLM Inference Speedup
Updated December 11, 2025
December 2025 Update: Speculative decoding maturing from research to production standard. NVIDIA demonstrating 3.6x throughput improvements on H200 GPUs. vLLM and TensorRT-LLM including native support. Draft models proposing 5-8 tokens verified in parallel—exploiting GPU capacity underutilized by single-token generation. Output quality unchanged; latency reduced 2-3x.
Large language models generate text one token at a time, and each token requires a full forward pass through billions of parameters. The sequential bottleneck creates latency that frustrates users waiting for responses, even when GPUs sit partially idle during computation. Speculative decoding breaks the bottleneck by using small, fast draft models to propose multiple tokens that larger target models verify in parallel, achieving 2-3x speedup without changing the output quality.¹
The technique has matured from research curiosity to production standard in 2025. Both vLLM and TensorRT-LLM include native speculative decoding support, with NVIDIA demonstrating 3.6x throughput improvements on H200 GPUs.² Understanding when speculative decoding helps, how to select draft models, and which frameworks offer the best implementations enables organizations to dramatically reduce inference costs and latency.
How speculative decoding works
Traditional autoregressive generation produces tokens sequentially:
- Model receives prompt, generates logits for next token
- Sample token from distribution
- Append token to context, repeat forward pass
- Continue until completion
Each step requires the full model's computation, but GPUs have far more capacity than single-token generation utilizes. Speculative decoding exploits the unused capacity:
Draft phase: A small, fast model generates K speculative tokens quickly. The draft model might produce 5-8 candidate continuations in the time the target model takes for one token.
Verify phase: The target model processes all K tokens in a single parallel forward pass, computing probabilities for each position simultaneously. GPU parallelism enables verification of K tokens with similar cost to generating one.
Accept/reject: Compare draft and target distributions at each position. Accept tokens where distributions align; reject and resample where they diverge. The algorithm guarantees output matches exactly what the target model would produce independently.³
The speedup comes from accepting multiple tokens per target model forward pass. If the draft model's acceptance rate averages 60% and proposes 8 tokens, each verification pass produces approximately 5 tokens versus 1 without speculation.
Performance benchmarks
Production deployments demonstrate substantial speedups across model families:
Llama models on vLLM:⁴ - Llama 3.1-70B with 1B draft: 2.31x speedup - Llama 3.1-8B on single A100: 1.8x latency reduction - Llama 3.1-70B at low request rates: 1.6x latency reduction
TensorRT-LLM on H200:⁵ - Llama 3.1-405B with varying draft models: >3x throughput - Combined with FP8 quantization: 3.6x total improvement
SGLang with SpecForge:⁶ - Llama 4 Maverick: 2.18x speedup on MT-Bench - Llama 4 Scout: 2.0x acceleration
EAGLE method (top performer):⁷ - Approximately 0.8 draft accuracy (80% acceptance) - 2.5-2.8x typical speedups - State-of-the-art on Spec-Bench leaderboard
Speedups vary significantly based on workload characteristics. Synchronous, latency-sensitive use cases see the largest gains. High-throughput batch processing benefits less as GPU compute becomes the bottleneck rather than sequential generation.
Framework implementations
vLLM speculative decoding
vLLM supports multiple speculative decoding methods including draft model, ngram matching, and EAGLE:
# Enable draft model speculation
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--speculative-draft-tensor-parallel-size 1
EAGLE integration (recommended):
# EAGLE achieves higher acceptance rates
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
--speculative-method eagle \
--num-speculative-tokens 8
vLLM's Eagle 3 integration delivers up to 2.5x speedup across diverse scenarios.⁸ The framework automatically handles token verification and rejection sampling, maintaining output equivalence with non-speculative generation.
TensorRT-LLM speculative decoding
TensorRT-LLM offers deeper optimization for NVIDIA hardware:
# Build engine with speculative decoding
trtllm-build \
--speculative_decoding_mode draft_tokens_external \
--max_draft_len 8 \
--checkpoint_dir $TARGET_CHECKPOINT \
--output_dir $ENGINE_DIR
For draft model configuration:
# Draft model with separate engine
trtllm-build \
--checkpoint_dir $DRAFT_CHECKPOINT \
--output_dir $DRAFT_ENGINE \
--max_batch_size 256
TensorRT-LLM's custom kernels optimize both draft generation and verification phases, extracting maximum performance from Tensor Cores and memory bandwidth.
Triton Inference Server integration
NVIDIA Triton Inference Server supports speculative decoding through vLLM backend:⁹
model_repository/
└── speculative_llm/
├── config.pbtxt
└── 1/
└── model.py
The Triton integration enables production-scale deployment with request batching, metrics collection, and Kubernetes-native scaling while maintaining speculative decoding benefits.
Draft model selection
Draft model quality determines speculative decoding effectiveness. Poor draft models waste compute on proposals the target model rejects.
Selection criteria
Architecture alignment: Draft models from the same family as targets achieve higher acceptance. Llama 3.2-1B drafting for Llama 3.1-70B outperforms generic small models because training data and tokenization align.¹⁰
Size ratio: Draft models typically range from 1/10 to 1/50 the target size. Smaller drafts generate faster but may have lower acceptance. Test multiple sizes to find the optimal ratio for your workload.
Acceptance rate threshold: Aim for 60%+ acceptance rate. Below 50%, verification overhead can negate speculation benefits. Use profiling to measure actual acceptance for your specific prompts.
Fine-tuning draft models
Out-of-box draft models often underperform on domain-specific tasks. Fine-tuning dramatically improves acceptance:¹¹
# Fine-tune draft model on target distribution
from transformers import Trainer, TrainingArguments
# Generate training data by sampling from target model
# Fine-tune draft to match target's output distribution
training_args = TrainingArguments(
output_dir="./draft_finetuned",
per_device_train_batch_size=8,
num_train_epochs=3,
learning_rate=2e-5,
)
trainer = Trainer(
model=draft_model,
args=training_args,
train_dataset=target_samples,
)
trainer.train()
Organizations report 20-40% acceptance rate improvements from domain-specific draft fine-tuning. The investment pays dividends for high-volume inference workloads.
SpecForge for SGLang
SpecForge provides purpose-built ecosystem for training draft models:¹²
- Native SGLang integration
- Optimized training recipes for Llama 4 variants
- Pre-trained speculators for common models
Red Hat's Speculators project standardizes speculative decoding with unified Hugging Face format and vLLM integration, simplifying draft model discovery and deployment.¹³
Advanced techniques
Self-speculative decoding (SWIFT)
SWIFT eliminates separate draft models by adaptively skipping intermediate layers of the target LLM:¹⁴
- No auxiliary model required
- No additional training needed
- 1.3x-1.6x speedup while preserving output distribution
The technique works by predicting which layers can be skipped based on token confidence. Simple continuations skip more layers; complex reasoning uses full model depth.
# Conceptual SWIFT configuration
config = SwiftConfig(
skip_threshold=0.8, # Skip layers when confidence > 0.8
min_layers=16, # Always use at least 16 layers
adaptive=True # Dynamically adjust per token
)
SWIFT suits scenarios where maintaining a separate draft model adds unwanted complexity.
Ngram speculation
For structured outputs or predictable patterns, ngram matching provides speculation without neural networks:
# vLLM ngram speculation
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model "[ngram]" \
--ngram-prompt-lookup-max 4 \
--num-speculative-tokens 4
Ngram speculation identifies repeated patterns in the prompt or generation history, proposing tokens based on observed sequences. The approach works well for code generation, structured data, and repetitive content.
Medusa heads
Medusa attaches additional prediction heads to the target model, generating multiple candidate tokens in parallel:
# Medusa requires model modification
model = load_medusa_model("path/to/medusa_llama_70b")
# Additional heads predict tokens at positions +1, +2, +3, ...
Medusa eliminates the draft model entirely but requires model modification and retraining. Organizations with custom model deployments may find Medusa worthwhile despite higher integration complexity.
When speculative decoding helps
Speculative decoding delivers strongest returns under specific conditions:
Favorable scenarios: - Interactive chat applications prioritizing latency - Single-user inference where GPU underutilization is high - Long-form generation (stories, documents, code) - Workloads with predictable token patterns
Less favorable scenarios: - High-throughput batch processing already saturating GPU - Very short responses (few tokens to speculate) - Highly creative/random generation with low acceptance rates - Memory-constrained deployments where draft model doesn't fit
Decision framework:
IF (GPU utilization < 50% during generation)
AND (average response length > 100 tokens)
AND (draft model fits in memory)
→ Enable speculative decoding
IF (GPU utilization > 80%)
OR (memory pressure high)
→ Focus on batching optimizations instead
Infrastructure considerations
Speculative decoding introduces specific infrastructure requirements:
Memory overhead: Draft models consume additional GPU memory. Ensure sufficient headroom: - Draft model weights: ~1-8GB depending on size - Additional KV cache for draft tokens - Verification tensor allocations
Compute patterns: Verification phases create bursty compute patterns different from steady autoregressive generation. Monitor GPU utilization variability and adjust batch sizes accordingly.
Draft model serving: Options include: - Co-located: Draft runs on same GPU(s) as target - Separate: Dedicated GPU for draft generation - CPU-offloaded: Small drafts can run on CPU for memory savings
Organizations deploying speculative decoding at scale can leverage Introl's GPU infrastructure expertise for optimal hardware configuration and capacity planning.
Production deployment checklist
Before enabling speculative decoding in production:
1. Baseline measurement - Measure current latency and throughput - Profile GPU utilization during generation - Identify bottlenecks (memory, compute, communication)
2. Draft model selection - Test multiple draft sizes with representative prompts - Measure acceptance rates for your specific distribution - Consider fine-tuning if acceptance below 60%
3. Configuration tuning - Experiment with num_speculative_tokens (typically 4-8) - Balance acceptance rate vs draft overhead - Profile memory usage with target batch sizes
4. Rollout strategy - A/B test against non-speculative baseline - Monitor latency percentiles (p50, p90, p99) - Watch for edge cases with low acceptance
5. Ongoing optimization - Track acceptance rates over time - Retune as workload characteristics change - Evaluate new draft models and techniques
The latency imperative
LLM inference costs dominate AI infrastructure budgets, and latency directly impacts user experience and application design constraints. Speculative decoding offers 2-3x speedup without accuracy tradeoffs, making the technique essential for latency-sensitive deployments.
The 2025 ecosystem maturation means speculative decoding moved from experimental optimization to standard practice. vLLM, TensorRT-LLM, and SGLang all provide production-ready implementations. EAGLE and similar methods achieve acceptance rates approaching 80%, maximizing speculation benefits.
For organizations running interactive LLM applications, speculative decoding should be the first optimization evaluated after basic deployment. The technique compounds with other optimizations like quantization and batching, enabling inference systems that deliver both cost efficiency and responsive user experiences. As draft model training tools like SpecForge mature, domain-specific speculation will become increasingly accessible, pushing speedups beyond current 2-3x benchmarks toward even more dramatic improvements.
References
-
NVIDIA. "An Introduction to Speculative Decoding for Reducing Latency in AI Inference." NVIDIA Technical Blog. 2025. https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
-
NVIDIA. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x." NVIDIA Technical Blog. 2025. https://developer.nvidia.com/blog/tensorrt-llm-speculative-decoding-boosts-inference-throughput-by-up-to-3-6x/
-
AMD. "Speculative Decoding - Deep Dive." ROCm Blogs. 2025. https://rocm.blogs.amd.com/software-tools-optimization/speculative-decoding---deep-dive/README.html
-
vLLM. "How Speculative Decoding Boosts vLLM Performance by up to 2.8x." vLLM Blog. October 17, 2024. https://blog.vllm.ai/2024/10/17/spec-decode.html
-
NVIDIA. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput."
-
LMSYS. "SpecForge: Accelerating Speculative Decoding Training for SGLang." LMSYS Org Blog. July 25, 2025. https://lmsys.org/blog/2025-07-25-spec-forge/
-
———. "SpecForge: Accelerating Speculative Decoding Training for SGLang."
-
Red Hat Developer. "Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding." July 1, 2025. https://developers.redhat.com/articles/2025/07/01/fly-eagle3-fly-faster-inference-vllm-speculative-decoding
-
NVIDIA. "Speculative Decoding with vLLM." Triton Inference Server Documentation. 2025. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.html
-
BentoML. "Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model." BentoML Blog. 2025. https://www.bentoml.com/blog/3x-faster-llm-inference-with-speculative-decoding
-
———. "Speculative decoding." LLM Inference Handbook. 2025. https://bentoml.com/llm/inference-optimization/speculative-decoding
-
LMSYS. "SpecForge: Accelerating Speculative Decoding Training for SGLang."
-
Red Hat Developer. "Speculators: Standardized, production-ready speculative decoding." November 19, 2025. https://developers.redhat.com/articles/2025/11/19/speculators-standardized-production-ready-speculative-decoding
-
OpenReview. "SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration." ICLR 2025. https://openreview.net/forum?id=EKJhH5D5wA
Key takeaways
For ML engineers: - vLLM Llama 3.1-70B with 1B draft achieves 2.31x speedup; TensorRT-LLM on H200 delivers 3.6x throughput with FP8 - EAGLE method achieves ~80% acceptance rate (state-of-the-art); aim for 60%+ acceptance—below 50% may negate benefits - Draft model selection: same architecture family achieves higher acceptance; Llama 3.2-1B drafting for Llama 3.1-70B outperforms generic small models
For infrastructure architects: - Draft models typically 1/10 to 1/50 target size; memory overhead 1-8GB for draft weights plus KV cache - SWIFT self-speculative decoding eliminates separate draft model by adaptively skipping intermediate layers—1.3x-1.6x speedup - Ngram speculation works well for structured outputs (code, repetitive content) without neural network overhead
For operations teams: - Optimal scenarios: interactive chat, long-form generation, GPU utilization <50%, responses >100 tokens - Less beneficial: batch processing with GPU >80% utilization, very short responses, highly random generation - Domain-specific draft fine-tuning improves acceptance rates 20-40%; investment pays dividends for high-volume inference
For deployment teams:
- vLLM EAGLE integration: --speculative-method eagle --num-speculative-tokens 8 delivers up to 2.5x speedup
- TensorRT-LLM: --speculative_decoding_mode draft_tokens_external --max_draft_len 8 for external draft model support
- SpecForge (SGLang) and Red Hat Speculators standardize draft model training and deployment
SEO Elements
Squarespace Excerpt (156 characters)
Speculative decoding delivers 2-3x LLM inference speedup without quality loss. Complete guide to draft models, vLLM, TensorRT-LLM, and production deployment.
SEO Title (53 characters)
Speculative Decoding: 2-3x LLM Inference Speedup
SEO Description (154 characters)
Master speculative decoding with 2-3x inference speedup. Learn draft model selection, EAGLE integration, vLLM and TensorRT-LLM implementation for production.
Title Review
Current title "Speculative Decoding: Achieving 2-3x LLM Inference Speedup" works at 54 characters. Alternatives: - "Speculative Decoding Guide: Faster LLM Inference 2025" (51 chars) - "LLM Speculative Decoding: vLLM & TensorRT-LLM Guide" (49 chars)
URL Slug Recommendations
Primary: speculative-decoding-llm-inference-speedup-guide-2025 Alternative 1: speculative-decoding-draft-model-vllm-tensorrt-guide Alternative 2: llm-speculative-decoding-eagle-implementation-2025 Alternative 3: speculative-decoding-production-inference-optimization