Prompt Caching Infrastructure: Reducing LLM Costs and Latency

Anthropic's prompt caching reduces costs by up to 90% and latency by up to 85% for long prompts.¹ OpenAI achieves 50% cost reduction with automatic caching enabled by default. Research shows 31% of

Blake Crosley

Mar 17, 2026 12 min read Disclaimer

Prompt Caching Infrastructure: Reducing LLM Costs and Latency

December 2025 Update: Anthropic prefix caching delivering 90% cost reduction and 85% latency reduction for long prompts. OpenAI automatic caching enabled by default (50% cost savings). 31% of LLM queries exhibiting semantic similarity—massive inefficiency without caching. Cache reads at $0.30/M tokens vs $3.00/M fresh (Anthropic). Multi-tier caching architecture (semantic → prefix → inference) maximizing savings.

Anthropic's prompt caching reduces costs by up to 90% and latency by up to 85% for long prompts.¹ OpenAI achieves 50% cost reduction with automatic caching enabled by default. Research shows 31% of LLM queries exhibit semantic similarity to previous requests, representing massive inefficiency in deployments without caching infrastructure.² Organizations running production AI applications leave substantial money on the table without proper caching strategies.

Prompt caching operates at multiple levels—from provider-side prefix caching that reuses KV cache computations, to application-level semantic caching that returns previous responses for similar queries. Understanding each layer and when to deploy it helps organizations optimize both cost and latency for their specific workload patterns.

Caching fundamentals

LLM inference costs derive from two sources: processing input tokens and generating output tokens. Caching strategies target both:

Input token caching (prefix caching)

Every LLM request processes input tokens through the model's attention mechanism, generating key-value pairs stored in KV cache. When multiple requests share identical prefixes—system prompts, few-shot examples, or document context—the KV cache computation repeats unnecessarily.

Prefix caching solution: Store computed KV values for common prefixes. Subsequent requests with matching prefixes skip recomputation, starting from cached state.

Cost impact: - Anthropic: Cache reads cost $0.30/M tokens vs. $3.00/M for fresh processing (90% savings) - OpenAI: 50% discount for cached tokens - Google: Variable pricing based on context window

Latency impact: Skipping prefix computation reduces time-to-first-token by 50-85% depending on prefix length.

Output caching (semantic caching)

Some requests deserve identical responses—repeated questions, deterministic queries, or lookups that don't require regeneration.

Semantic caching solution: Store response outputs keyed by semantically similar inputs. Return cached responses without LLM invocation for matching queries.

Cost impact: Cached responses eliminate API calls entirely—100% savings on cache hits.

Latency impact: Response returns in milliseconds versus seconds for LLM inference.

Caching hierarchy

Production systems typically implement multiple caching layers:

Request → Semantic Cache (100% savings) → Prefix Cache (50-90% savings) → Full Inference
              ↓                                  ↓                              ↓
         Cached response              Cached KV state              Fresh computation

Each layer captures different optimization opportunities based on request similarity patterns.

Provider-level prompt caching

Anthropic Claude

Anthropic offers the most configurable prompt caching:³

Pricing: - Cache writes: 25% premium over base input price - Cache reads: 90% discount (10% of base price) - Break-even: 2+ cache hits per cached prefix

Requirements: - Minimum 1,024 tokens per cache checkpoint - Up to 4 cache checkpoints per request - Cache lifetime: 5 minutes from last access (extended to 1 hour with regular hits) - Up to 5 conversation turns cacheable

Implementation:

import anthropic

client = anthropic.Anthropic()

# Mark content for caching with cache_control
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert assistant for our enterprise software...",
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[{"role": "user", "content": "How do I configure user permissions?"}]
)

Best practices: - Place static content (system prompts, documentation) at prompt beginning - Place dynamic content (user input, conversation) at end - Use cache checkpoints at natural boundaries - Monitor cache hit rates to verify optimization

OpenAI

OpenAI implements automatic caching without code changes:⁴

Pricing: - Cached tokens: 50% of base input price - No cache write premium

Requirements: - Minimum 1,024 tokens for caching eligibility - Cache hits occur in 128-token increments - Cache lifetime: 5-10 minutes of inactivity

Automatic behavior: - Prompts exceeding 1,024 tokens automatically cache - System detects matching prefixes across requests - No API changes required

Monitoring:

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[...],
)

# Check usage for cache hits
print(f"Cached tokens: {response.usage.prompt_tokens_details.cached_tokens}")
print(f"Total input tokens: {response.usage.prompt_tokens}")

Google Gemini

Google provides context caching for Gemini models:⁵

Pricing: - Variable based on cached context size and duration - Storage fees for cached content

Features: - Explicit cache creation and management - Configurable time-to-live - Cross-request cache sharing

Implementation:

from google.generativeai import caching

# Create cached content
cache = caching.CachedContent.create(
    model='models/gemini-1.5-pro-001',
    display_name='product-documentation',
    system_instruction="You are a product expert...",
    contents=[product_docs],
    ttl=datetime.timedelta(hours=1)
)

# Use cached content in requests
model = genai.GenerativeModel.from_cached_content(cached_content=cache)
response = model.generate_content("How do I configure feature X?")

Amazon Bedrock

AWS offers prompt caching in preview for supported models:⁶

Requirements: - Claude 3.5 Sonnet requires 1,024 tokens minimum per checkpoint - Second checkpoint requires 2,048 tokens

Implementation pattern matches Anthropic's cache_control approach within Bedrock's API structure.

vLLM prefix caching

Self-hosted inference with vLLM includes automatic prefix caching:⁷

Architecture

vLLM's Automatic Prefix Caching (APC) stores KV blocks in a hash table, enabling cache reuse without tree structures:

Key design: - All KV blocks stored in block pool at initialization - Hash-based lookup for prefix matching - O(1) operations for block management - PagedAttention memory efficiency maintained

Configuration

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    enable_prefix_caching=True,  # Enable APC
    gpu_memory_utilization=0.90,
)

Performance impact

vLLM with PagedAttention demonstrates 14-24x higher throughput than naive implementations.⁸ Prefix caching adds:

10x cost difference between cached and uncached tokens
Order-of-magnitude latency reduction for matching prefixes
Memory efficiency through shared KV blocks

Security considerations

vLLM supports cache isolation for shared environments:

# Per-request cache salt prevents cross-tenant cache access
response = llm.generate(
    prompt="...",
    sampling_params=SamplingParams(...),
    cache_salt="tenant-123"  # Isolate cache by tenant
)

Cache salt injection into block hashes prevents timing attacks where adversaries infer cached content through latency observation.

LMCache extension

LMCache extends vLLM with advanced caching capabilities:⁹

Features: - KV cache reuse across engine instances - Multi-tier storage (GPU → CPU RAM → disk) - Non-prefix content caching - 3-10x latency reduction in benchmarks

Architecture:

vLLM Engine → LMCache → GPU VRAM (hot)
                     → CPU RAM (warm)
                     → Local Disk (cold)

Semantic caching

Semantic caching returns previous responses for semantically similar (not just identical) queries:

GPTCache

GPTCache provides open-source semantic caching for LLM applications:¹⁰

Architecture:

Query → Embedding → Vector Search → Similarity Check → Response/API Call
              ↓           ↓              ↓
         BERT/OpenAI   Milvus/FAISS   Threshold (0.8)

Components: - LLM Adapter: Integration with various LLM providers - Embedding Generator: Query vectorization - Vector Store: Similarity search (Milvus, FAISS, Zilliz) - Cache Manager: Storage and retrieval - Similarity Evaluator: Threshold-based matching

Implementation:

from gptcache import cache
from gptcache.adapter import openai

# Initialize semantic cache
cache.init(
    pre_embedding_func=get_text_embedding,
    data_manager=manager,
)

# Use cached OpenAI calls
response = openai.ChatCompletion.create(
    model='gpt-4',
    messages=[{"role": "user", "content": "What is machine learning?"}]
)
# Semantically similar queries ("Explain ML", "Define machine learning")
# return cached response

Performance

GPTCache achieves significant efficiency gains:¹¹

API call reduction: Up to 68.8% across query categories
Cache hit rates: 61.6% to 68.8%
Accuracy: 97%+ positive hit rate
Latency reduction: 40-50% on cache hits, up to 100x for full hits

Advanced techniques

VectorQ adaptive thresholds:¹²

Static similarity thresholds (e.g., 0.8) perform poorly across diverse queries. VectorQ learns embedding-specific threshold regions that adapt to query complexity:

Simple factual queries: Higher thresholds (stricter matching)
Open-ended queries: Lower thresholds (more reuse)
Ambiguous queries: Dynamic adjustment

SCALM pattern detection:

SCALM improves on GPTCache through pattern detection and frequency analysis: - 63% improvement in cache hit ratio - 77% reduction in token usage - Identifies high-frequency cache entry patterns

When to use semantic caching

Good candidates: - FAQ-style queries with limited answer space - Lookup queries (product info, documentation) - Deterministic responses (calculations, formatting) - High-traffic applications with query repetition

Poor candidates: - Creative generation requiring uniqueness - Personalized responses (user-specific context) - Time-sensitive information - Low-repetition query patterns

Implementation patterns

Chat applications

Chat systems benefit from both prefix and semantic caching:

System prompt caching:

# Static system prompt cached at request start
system_prompt = """
You are a customer support agent for Acme Corp...
[2000+ tokens of guidelines and knowledge]
"""

# Dynamic conversation appended after cached prefix
messages = [
    {"role": "system", "content": system_prompt, "cache_control": {...}},
    {"role": "user", "content": user_message}
]

Conversation history caching: Anthropic supports caching up to 5 conversation turns, reducing cost for multi-turn conversations.

RAG applications

Retrieval-augmented generation caches retrieved context:

# Cache structure for RAG
cached_context = {
    "system": system_prompt,           # Always cached
    "documents": retrieved_chunks,      # Cache per query cluster
    "examples": few_shot_examples       # Stable across requests
}

# Only user query varies
dynamic_content = {
    "query": user_question
}

Document chunk caching: When multiple queries retrieve the same documents, prefix caching eliminates redundant processing of shared context.

Agentic workflows

Agent systems with tool calling benefit from prefix caching:

System prompt → Tool definitions → Conversation history → Current query
    (cached)       (cached)           (partially cached)    (dynamic)

Each tool invocation that preserves prompt structure hits prefix cache, reducing cost of multi-step reasoning.

Monitoring and optimization

Key metrics

Track cache effectiveness through:

Cache hit rate:

Hit Rate = Cache Hits / Total Requests

Target: 30-60% for semantic caching, 80%+ for prefix caching with stable prompts

Cost savings:

Savings = (Full Price - Cached Price) × Cached Tokens

Latency reduction:

Latency Improvement = (Uncached TTFT - Cached TTFT) / Uncached TTFT

Provider dashboards

Most providers surface cache metrics:

Anthropic: Cache read/write tokens in usage API
OpenAI: cached_tokens in usage response
vLLM: Prometheus metrics for cache utilization

Optimization strategies

Maximize prefix overlap: - Structure prompts with static content first - Standardize system prompts across similar requests - Batch similar queries to maximize prefix sharing

Tune semantic thresholds: - Start conservative (0.85+) and lower based on accuracy - Monitor false positive rates (incorrect cache returns) - Segment thresholds by query category

Manage cache lifecycle: - Extend cache TTL for high-value prefixes - Pre-warm caches for predictable traffic patterns - Invalidate caches when underlying data changes

Cost modeling

Prefix caching economics

Calculate break-even for prefix caching investment:

Anthropic example:

Base input cost: $3.00/M tokens
Cache write cost: $3.75/M tokens (25% premium)
Cache read cost: $0.30/M tokens (90% discount)

Break-even:
$3.75 (write) = N × $2.70 (savings per read)
N = 1.4 reads per cached prefix

ROI at 10 reads:
Savings = 10 × $2.70 - $0.75 = $26.25 per million cached tokens

OpenAI example:

Base input cost: $2.50/M tokens (GPT-4-turbo)
Cached cost: $1.25/M tokens (50% discount)
No write premium

ROI at any cache hit:
Savings = $1.25 per million cached tokens

Semantic caching economics

Calculate value of full response caching:

API cost per request: $0.05 (average)
Cache infrastructure cost: $0.001 per cached response per day
Cache hit rate: 50%

Daily requests: 100,000
Without caching: 100,000 × $0.05 = $5,000
With caching: 50,000 × $0.05 + 50,000 × $0 + $0.001 × 50,000 = $2,550

Daily savings: $2,450 (49%)

Organizations optimizing LLM inference costs can leverage Introl's infrastructure expertise for deployment planning across global locations.

The caching imperative

Prompt caching represents one of the highest-impact optimizations for production LLM applications. Provider-level prefix caching delivers 50-90% cost reduction with minimal implementation effort—often automatic. Semantic caching adds another layer, eliminating API calls entirely for repetitive queries.

The optimization compounds. A chat application with stable system prompts, consistent document retrieval, and repetitive user questions might cache 70%+ of input tokens through prefix caching while semantic caching handles 30% of queries outright. Combined savings can exceed 80% versus naive implementation.

Implementation priority should follow impact:

Enable provider caching: Most direct path to savings. OpenAI automatic, Anthropic requires cache_control markers.
Optimize prompt structure: Move static content to prefix, dynamic content to suffix. Maximize prefix overlap across requests.
Add semantic caching: For high-traffic applications with query repetition. GPTCache or custom implementation.
Tune and monitor: Track hit rates, adjust thresholds, invalidate stale caches.

The 31% of queries showing semantic similarity represents billions of dollars in wasted LLM inference across the industry. Organizations that implement proper caching capture those savings while simultaneously improving latency for their users. The infrastructure exists. The economics are clear. The only question is implementation priority.

Key Takeaways

For ML engineers: - Anthropic prefix caching: 90% cost reduction at $0.30/M tokens vs. $3.00/M base—break-even at 1.4 reads per cached prefix - OpenAI automatic caching: 50% discount, no code changes needed for prompts exceeding 1,024 tokens - vLLM PagedAttention delivers 14-24x higher throughput than naive implementations; APC adds 10x cost difference - 31% of LLM queries show semantic similarity—massive inefficiency without caching infrastructure

For infrastructure architects: - Multi-tier caching architecture: semantic cache (100% savings) → prefix cache (50-90% savings) → full inference - vLLM cache isolation via cache_salt parameter prevents cross-tenant cache access in shared environments - LMCache extends vLLM with GPU → CPU RAM → disk tiering for 3-10x latency reduction - Anthropic supports caching up to 5 conversation turns—structure multi-turn prompts for maximum cache reuse

For platform teams: - GPTCache achieves 61.6-68.8% cache hit rates with 97%+ positive hit accuracy - Static semantic thresholds (0.8) perform poorly—VectorQ adaptive thresholds improve accuracy across diverse queries - SCALM pattern detection: 63% improvement in cache hit ratio, 77% reduction in token usage - Target 30-60% hit rates for semantic caching, 80%+ for prefix caching with stable prompts

For finance teams: - Combined caching strategies can exceed 80% cost reduction vs. naive implementation - Calculate break-even: Anthropic requires 2+ hits per cached prefix; OpenAI breaks even on first hit - Semantic caching ROI: 50% hit rate on 100K daily requests saves $2,450/day at $0.05/request average - Cache infrastructure costs ~$0.001 per cached response per day

For operations teams: - Monitor cache_read/cache_write tokens in Anthropic usage API; cached_tokens in OpenAI responses - Pre-warm caches for predictable traffic patterns; invalidate when underlying data changes - RAG applications: cache retrieved document chunks—multiple queries hitting same documents maximize savings - Agentic workflows: tool definitions and system prompts hit prefix cache across multi-step reasoning

References

Anthropic. "Prompt caching with Claude." August 2024. https://www.anthropic.com/news/prompt-caching
arXiv. "GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching." 2024. https://arxiv.org/abs/2411.05276
Anthropic. "Prompt caching with Claude."
LLMindset. "OpenAI Prompt Caching." October 2024. https://llmindset.co.uk/posts/2024/10/openai-prompt-caching/
Phase 2. "Optimizing LLM Costs: A Comprehensive Analysis of Context Caching Strategies." April 28, 2025. https://phase2online.com/2025/04/28/optimizing-llm-costs-with-context-caching/
Amazon Web Services. "Prompt caching for faster model inference - Amazon Bedrock." 2025. https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
vLLM Documentation. "Automatic Prefix Caching." 2025. https://docs.vllm.ai/en/latest/design/prefix_caching/
llm-d. "KV-Cache Wins You Can See: From Prefix Caching in vLLM to Distributed Scheduling with llm-d." 2025. https://llm-d.ai/blog/kvcache-wins-you-can-see
BentoML. "KV cache offloading." LLM Inference Handbook. 2025. https://bentoml.com/llm/inference-optimization/kv-cache-offloading
GitHub. "zilliztech/GPTCache: Semantic cache for LLMs." 2024. https://github.com/zilliztech/GPTCache
ACL Anthology. "GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings." 2023. https://aclanthology.org/2023.nlposs-1.24.pdf
arXiv. "Adaptive Semantic Prompt Caching with VectorQ." 2025. https://arxiv.org/html/2502.03771v1

Caching fundamentals

Input token caching (prefix caching)

Output caching (semantic caching)

Caching hierarchy

Provider-level prompt caching

Anthropic Claude

OpenAI

Google Gemini

Amazon Bedrock

vLLM prefix caching

Architecture

Configuration

Performance impact

Security considerations

LMCache extension

Semantic caching

GPTCache

Performance

Advanced techniques

When to use semantic caching

Implementation patterns

Chat applications

RAG applications

Agentic workflows

Monitoring and optimization

Key metrics

Provider dashboards

Optimization strategies

Cost modeling

Prefix caching economics

Semantic caching economics

The caching imperative

Key Takeaways

References

You Might Also Like

Hyperscaler CapEx Hits $690B in 2026: Inside Microsoft's $80...

AI Accelerators Beyond GPUs: TPU, Trainium, Gaudi, Groq, Cer...

High-Density Racks: 100kW+ Designs for AI Data Center Infras...

Request a Quote_

Request Received_