Speculative Decoding: Achieving 2-3x LLM Inference Speedup

Large language models generate text one token at a time, and each token requires a full forward pass through billions of parameters. The sequential bottleneck creates latency that frustrates users

Blake Crosley

Apr 09, 2026 10 min read Disclaimer

Speculative Decoding: Achieving 2-3x LLM Inference Speedup

December 2025 Update: Speculative decoding maturing from research to production standard. NVIDIA demonstrating 3.6x throughput improvements on H200 GPUs. vLLM and TensorRT-LLM including native support. Draft models proposing 5-8 tokens verified in parallel—exploiting GPU capacity underutilized by single-token generation. Output quality unchanged; latency reduced 2-3x.

Large language models generate text one token at a time, and each token requires a full forward pass through billions of parameters. The sequential bottleneck creates latency that frustrates users waiting for responses, even when GPUs sit partially idle during computation. Speculative decoding breaks the bottleneck by using small, fast draft models to propose multiple tokens that larger target models verify in parallel, achieving 2-3x speedup without changing the output quality.¹

The technique has matured from research curiosity to production standard in 2025. Both vLLM and TensorRT-LLM include native speculative decoding support, with NVIDIA demonstrating 3.6x throughput improvements on H200 GPUs.² Understanding when speculative decoding helps, how to select draft models, and which frameworks offer the best implementations enables organizations to dramatically reduce inference costs and latency.

How speculative decoding works

Traditional autoregressive generation produces tokens sequentially:

Model receives prompt, generates logits for next token
Sample token from distribution
Append token to context, repeat forward pass
Continue until completion

Each step requires the full model's computation, but GPUs have far more capacity than single-token generation utilizes. Speculative decoding exploits the unused capacity:

Draft phase: A small, fast model generates K speculative tokens quickly. The draft model might produce 5-8 candidate continuations in the time the target model takes for one token.

Verify phase: The target model processes all K tokens in a single parallel forward pass, computing probabilities for each position simultaneously. GPU parallelism enables verification of K tokens with similar cost to generating one.

Accept/reject: Compare draft and target distributions at each position. Accept tokens where distributions align; reject and resample where they diverge. The algorithm guarantees output matches exactly what the target model would produce independently.³

The speedup comes from accepting multiple tokens per target model forward pass. If the draft model's acceptance rate averages 60% and proposes 8 tokens, each verification pass produces approximately 5 tokens versus 1 without speculation.

Performance benchmarks

Production deployments demonstrate substantial speedups across model families:

Llama models on vLLM:⁴ - Llama 3.1-70B with 1B draft: 2.31x speedup - Llama 3.1-8B on single A100: 1.8x latency reduction - Llama 3.1-70B at low request rates: 1.6x latency reduction

TensorRT-LLM on H200:⁵ - Llama 3.1-405B with varying draft models: >3x throughput - Combined with FP8 quantization: 3.6x total improvement

SGLang with SpecForge:⁶ - Llama 4 Maverick: 2.18x speedup on MT-Bench - Llama 4 Scout: 2.0x acceleration

EAGLE method (top performer):⁷ - Approximately 0.8 draft accuracy (80% acceptance) - 2.5-2.8x typical speedups - State-of-the-art on Spec-Bench leaderboard

Speedups vary significantly based on workload characteristics. Synchronous, latency-sensitive use cases see the largest gains. High-throughput batch processing benefits less as GPU compute becomes the bottleneck rather than sequential generation.

Framework implementations

vLLM speculative decoding

vLLM supports multiple speculative decoding methods including draft model, ngram matching, and EAGLE:

# Enable draft model speculation
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --speculative-draft-tensor-parallel-size 1

EAGLE integration (recommended):

# EAGLE achieves higher acceptance rates
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
    --speculative-method eagle \
    --num-speculative-tokens 8

vLLM's Eagle 3 integration delivers up to 2.5x speedup across diverse scenarios.⁸ The framework automatically handles token verification and rejection sampling, maintaining output equivalence with non-speculative generation.

TensorRT-LLM speculative decoding

TensorRT-LLM offers deeper optimization for NVIDIA hardware:

# Build engine with speculative decoding
trtllm-build \
    --speculative_decoding_mode draft_tokens_external \
    --max_draft_len 8 \
    --checkpoint_dir $TARGET_CHECKPOINT \
    --output_dir $ENGINE_DIR

For draft model configuration:

# Draft model with separate engine
trtllm-build \
    --checkpoint_dir $DRAFT_CHECKPOINT \
    --output_dir $DRAFT_ENGINE \
    --max_batch_size 256

TensorRT-LLM's custom kernels optimize both draft generation and verification phases, extracting maximum performance from Tensor Cores and memory bandwidth.

Triton Inference Server integration

NVIDIA Triton Inference Server supports speculative decoding through vLLM backend:⁹

model_repository/
└── speculative_llm/
    ├── config.pbtxt
    └── 1/
        └── model.py

The Triton integration enables production-scale deployment with request batching, metrics collection, and Kubernetes-native scaling while maintaining speculative decoding benefits.

Draft model selection

Draft model quality determines speculative decoding effectiveness. Poor draft models waste compute on proposals the target model rejects.

Selection criteria

Architecture alignment: Draft models from the same family as targets achieve higher acceptance. Llama 3.2-1B drafting for Llama 3.1-70B outperforms generic small models because training data and tokenization align.¹⁰

Size ratio: Draft models typically range from 1/10 to 1/50 the target size. Smaller drafts generate faster but may have lower acceptance. Test multiple sizes to find the optimal ratio for your workload.

Acceptance rate threshold: Aim for 60%+ acceptance rate. Below 50%, verification overhead can negate speculation benefits. Use profiling to measure actual acceptance for your specific prompts.

Fine-tuning draft models

Out-of-box draft models often underperform on domain-specific tasks. Fine-tuning dramatically improves acceptance:¹¹

# Fine-tune draft model on target distribution
from transformers import Trainer, TrainingArguments

# Generate training data by sampling from target model
# Fine-tune draft to match target's output distribution

training_args = TrainingArguments(
    output_dir="./draft_finetuned",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
)

trainer = Trainer(
    model=draft_model,
    args=training_args,
    train_dataset=target_samples,
)
trainer.train()

Organizations report 20-40% acceptance rate improvements from domain-specific draft fine-tuning. The investment pays dividends for high-volume inference workloads.

SpecForge for SGLang

SpecForge provides purpose-built ecosystem for training draft models:¹²

Native SGLang integration
Optimized training recipes for Llama 4 variants
Pre-trained speculators for common models

Red Hat's Speculators project standardizes speculative decoding with unified Hugging Face format and vLLM integration, simplifying draft model discovery and deployment.¹³

Advanced techniques

Self-speculative decoding (SWIFT)

SWIFT eliminates separate draft models by adaptively skipping intermediate layers of the target LLM:¹⁴

No auxiliary model required
No additional training needed
1.3x-1.6x speedup while preserving output distribution

The technique works by predicting which layers can be skipped based on token confidence. Simple continuations skip more layers; complex reasoning uses full model depth.

# Conceptual SWIFT configuration
config = SwiftConfig(
    skip_threshold=0.8,  # Skip layers when confidence > 0.8
    min_layers=16,       # Always use at least 16 layers
    adaptive=True        # Dynamically adjust per token
)

SWIFT suits scenarios where maintaining a separate draft model adds unwanted complexity.

Ngram speculation

For structured outputs or predictable patterns, ngram matching provides speculation without neural networks:

# vLLM ngram speculation
vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model "[ngram]" \
    --ngram-prompt-lookup-max 4 \
    --num-speculative-tokens 4

Ngram speculation identifies repeated patterns in the prompt or generation history, proposing tokens based on observed sequences. The approach works well for code generation, structured data, and repetitive content.

Medusa heads

Medusa attaches additional prediction heads to the target model, generating multiple candidate tokens in parallel:

# Medusa requires model modification
model = load_medusa_model("path/to/medusa_llama_70b")
# Additional heads predict tokens at positions +1, +2, +3, ...

Medusa eliminates the draft model entirely but requires model modification and retraining. Organizations with custom model deployments may find Medusa worthwhile despite higher integration complexity.

When speculative decoding helps

Speculative decoding delivers strongest returns under specific conditions:

Favorable scenarios: - Interactive chat applications prioritizing latency - Single-user inference where GPU underutilization is high - Long-form generation (stories, documents, code) - Workloads with predictable token patterns

Less favorable scenarios: - High-throughput batch processing already saturating GPU - Very short responses (few tokens to speculate) - Highly creative/random generation with low acceptance rates - Memory-constrained deployments where draft model doesn't fit

Decision framework:

IF (GPU utilization < 50% during generation)
    AND (average response length > 100 tokens)
    AND (draft model fits in memory)
    → Enable speculative decoding

IF (GPU utilization > 80%)
    OR (memory pressure high)
    → Focus on batching optimizations instead

Infrastructure considerations

Speculative decoding introduces specific infrastructure requirements:

Memory overhead: Draft models consume additional GPU memory. Ensure sufficient headroom: - Draft model weights: ~1-8GB depending on size - Additional KV cache for draft tokens - Verification tensor allocations

Compute patterns: Verification phases create bursty compute patterns different from steady autoregressive generation. Monitor GPU utilization variability and adjust batch sizes accordingly.

Draft model serving: Options include: - Co-located: Draft runs on same GPU(s) as target - Separate: Dedicated GPU for draft generation - CPU-offloaded: Small drafts can run on CPU for memory savings

Organizations deploying speculative decoding at scale can leverage Introl's GPU infrastructure expertise for optimal hardware configuration and capacity planning.

Production deployment checklist

Before enabling speculative decoding in production:

1. Baseline measurement - Measure current latency and throughput - Profile GPU utilization during generation - Identify bottlenecks (memory, compute, communication)

2. Draft model selection - Test multiple draft sizes with representative prompts - Measure acceptance rates for your specific distribution - Consider fine-tuning if acceptance below 60%

3. Configuration tuning - Experiment with num_speculative_tokens (typically 4-8) - Balance acceptance rate vs draft overhead - Profile memory usage with target batch sizes

4. Rollout strategy - A/B test against non-speculative baseline - Monitor latency percentiles (p50, p90, p99) - Watch for edge cases with low acceptance

5. Ongoing optimization - Track acceptance rates over time - Retune as workload characteristics change - Evaluate new draft models and techniques

The latency imperative

LLM inference costs dominate AI infrastructure budgets, and latency directly impacts user experience and application design constraints. Speculative decoding offers 2-3x speedup without accuracy tradeoffs, making the technique essential for latency-sensitive deployments.

The 2025 ecosystem maturation means speculative decoding moved from experimental optimization to standard practice. vLLM, TensorRT-LLM, and SGLang all provide production-ready implementations. EAGLE and similar methods achieve acceptance rates approaching 80%, maximizing speculation benefits.

For organizations running interactive LLM applications, speculative decoding should be the first optimization evaluated after basic deployment. The technique compounds with other optimizations like quantization and batching, enabling inference systems that deliver both cost efficiency and responsive user experiences. As draft model training tools like SpecForge mature, domain-specific speculation will become increasingly accessible, pushing speedups beyond current 2-3x benchmarks toward even more dramatic improvements.

References

NVIDIA. "An Introduction to Speculative Decoding for Reducing Latency in AI Inference." NVIDIA Technical Blog. 2025. https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
NVIDIA. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x." NVIDIA Technical Blog. 2025. https://developer.nvidia.com/blog/tensorrt-llm-speculative-decoding-boosts-inference-throughput-by-up-to-3-6x/
AMD. "Speculative Decoding - Deep Dive." ROCm Blogs. 2025. https://rocm.blogs.amd.com/software-tools-optimization/speculative-decoding---deep-dive/README.html
vLLM. "How Speculative Decoding Boosts vLLM Performance by up to 2.8x." vLLM Blog. October 17, 2024. https://blog.vllm.ai/2024/10/17/spec-decode.html
NVIDIA. "TensorRT-LLM Speculative Decoding Boosts Inference Throughput."
LMSYS. "SpecForge: Accelerating Speculative Decoding Training for SGLang." LMSYS Org Blog. July 25, 2025. https://lmsys.org/blog/2025-07-25-spec-forge/
———. "SpecForge: Accelerating Speculative Decoding Training for SGLang."
Red Hat Developer. "Fly Eagle(3) fly: Faster inference with vLLM & speculative decoding." July 1, 2025. https://developers.redhat.com/articles/2025/07/01/fly-eagle3-fly-faster-inference-vllm-speculative-decoding
NVIDIA. "Speculative Decoding with vLLM." Triton Inference Server Documentation. 2025. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tutorials/Feature_Guide/Speculative_Decoding/vLLM/README.html
BentoML. "Get 3× Faster LLM Inference with Speculative Decoding Using the Right Draft Model." BentoML Blog. 2025. https://www.bentoml.com/blog/3x-faster-llm-inference-with-speculative-decoding
———. "Speculative decoding." LLM Inference Handbook. 2025. https://bentoml.com/llm/inference-optimization/speculative-decoding
LMSYS. "SpecForge: Accelerating Speculative Decoding Training for SGLang."
Red Hat Developer. "Speculators: Standardized, production-ready speculative decoding." November 19, 2025. https://developers.redhat.com/articles/2025/11/19/speculators-standardized-production-ready-speculative-decoding
OpenReview. "SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration." ICLR 2025. https://openreview.net/forum?id=EKJhH5D5wA

Key takeaways

For ML engineers: - vLLM Llama 3.1-70B with 1B draft achieves 2.31x speedup; TensorRT-LLM on H200 delivers 3.6x throughput with FP8 - EAGLE method achieves ~80% acceptance rate (state-of-the-art); aim for 60%+ acceptance—below 50% may negate benefits - Draft model selection: same architecture family achieves higher acceptance; Llama 3.2-1B drafting for Llama 3.1-70B outperforms generic small models

For infrastructure architects: - Draft models typically 1/10 to 1/50 target size; memory overhead 1-8GB for draft weights plus KV cache - SWIFT self-speculative decoding eliminates separate draft model by adaptively skipping intermediate layers—1.3x-1.6x speedup - Ngram speculation works well for structured outputs (code, repetitive content) without neural network overhead

For operations teams: - Optimal scenarios: interactive chat, long-form generation, GPU utilization <50%, responses >100 tokens - Less beneficial: batch processing with GPU >80% utilization, very short responses, highly random generation - Domain-specific draft fine-tuning improves acceptance rates 20-40%; investment pays dividends for high-volume inference

For deployment teams: - vLLM EAGLE integration: --speculative-method eagle --num-speculative-tokens 8 delivers up to 2.5x speedup - TensorRT-LLM: --speculative_decoding_mode draft_tokens_external --max_draft_len 8 for external draft model support - SpecForge (SGLang) and Red Hat Speculators standardize draft model training and deployment