Inference Unit Economics: The True Cost Per Million Tokens

The LLM inference market defies conventional technology economics. Prices declined faster than PC compute during the microprocessor revolution or bandwidth during the dotcom boom—equivalent

Blake Crosley

Feb 09, 2026 14 min read Disclaimer

Inference Unit Economics: The True Cost Per Million Tokens

December 2025 Update: LLM inference costs declined 10x annually—faster than PC compute or dotcom bandwidth. GPT-4 equivalent performance now costs $0.40/million tokens versus $20 in late 2022. Cloud H100 prices stabilized at $2.85-$3.50/hour after 64-75% decline from peaks. DeepSeek disrupted market with 90% lower pricing than incumbents. Self-hosted breakeven requires 50%+ GPU utilization for 7B models, 10%+ for 13B models. Quantization reducing operational costs 60-70%. Speculative decoding cutting latency 2-3x.

The LLM inference market defies conventional technology economics. Prices declined faster than PC compute during the microprocessor revolution or bandwidth during the dotcom boom—equivalent performance costs 10x less every year.¹ A capability that cost $20 per million tokens in late 2022 now costs $0.40.² Yet organizations still struggle to understand their true inference costs because token-level pricing obscures infrastructure realities, GPU utilization determines actual unit economics, and optimization techniques create order-of-magnitude variations in cost efficiency. Mastering inference economics determines whether AI deployments generate value or hemorrhage capital.

The inference pricing landscape in December 2025

API pricing spans three orders of magnitude depending on model capability, provider, and optimization. Understanding the current landscape provides context for economic decision-making.

Budget tier models now cost fractions of a cent per million tokens. Google's Gemini Flash-Lite leads at $0.075 per million input tokens and $0.30 per million output tokens.³ Open-source models through providers like Together.ai or Hyperbolic reach even lower—Llama 3.2 3B runs at $0.06 per million tokens, achieving MMLU scores of 42 at 1/1000th the cost of three years ago.⁴

Mid-tier production models balance capability against cost. Claude Sonnet 4 prices at $3 per million input tokens and $15 per million output tokens.⁵ DeepSeek's R1 model disrupted the market at $0.55 input and $2.19 output per million tokens—90% below Western competitors for comparable reasoning capability.⁶ Chinese providers consistently undercut Western incumbents, introducing price pressure that benefits all buyers.

Frontier capability models command premium pricing. Claude Opus 4 costs $15 per million input tokens and $75 per million output tokens.⁷ GPT-4 and similar frontier models price similarly, justified by capabilities that smaller models cannot replicate regardless of cost optimization.

Provider variation adds complexity. For identical models, prices range 10x between cheapest and most expensive providers.⁸ One model might cost $0.90 per million tokens from the cheapest provider, $3.50 at median, and $9.50 from the most expensive. Shopping across providers significantly impacts economics before any technical optimization begins.

Output token pricing asymmetry reflects actual costs. OpenAI, Anthropic, and Google price output tokens 3-5x higher than input tokens because output generation requires sequential processing while input processing parallelizes efficiently.⁹ Applications generating long outputs face different economics than those processing long inputs with brief responses.

Understanding true GPU infrastructure costs

Behind API pricing lies GPU infrastructure with its own cost structure. Understanding these economics enables informed build-versus-buy decisions.

Hardware acquisition costs start high and continue accumulating. NVIDIA H100 GPUs cost $25,000-$40,000 per card, with complete 8-GPU server systems reaching $200,000-$400,000 including infrastructure.¹⁰ NVIDIA's manufacturing cost runs approximately $3,320 per H100—the gap between production cost and sale price reflects demand-driven margins that have only recently begun moderating.

Cloud GPU rental rates have stabilized after dramatic declines. H100 SXM instances range from $1.49/hour (Hyperbolic) to $6.98/hour (Azure), with most providers clustering around $2.85-$3.50/hour after 64-75% declines from peak prices.¹¹ Reserved capacity reduces rates further—Lambda Labs offers $1.85/hour and Hyperstack starts at $1.90/hour with commitments.

Power and cooling costs compound hardware expenses. Each H100 consumes up to 700W under load. Multi-GPU clusters require dedicated power distribution units potentially costing $10,000-$50,000 for facility upgrades.¹² Liquid cooling infrastructure or enhanced HVAC systems add $15,000-$100,000 depending on scale. These costs amortize across GPU hours but significantly impact total ownership economics.

Operational overhead bridges gap between hardware rental and actual cost. Factoring cooling, facilities, and maintenance adds approximately $2-7 per hour to raw GPU rental rates, bringing true 8×H100 operational cost to $8-$15/hour when properly amortized.¹³ Organizations comparing cloud rental to API pricing must include these hidden costs to make valid comparisons.

The utilization equation that determines viability

GPU utilization determines whether self-hosted inference makes economic sense. Paying for a GPU running at 10% load transforms $0.013 per thousand tokens into $0.13—more expensive than premium APIs.¹⁴

Breakeven analysis depends on model size and utilization targets. Hosting a 7B model requires approximately 50% utilization to cost less than GPT-3.5 Turbo.¹⁵ A 13B model achieves cost parity with GPT-4-turbo at only 10% utilization because the larger model's capability premium justifies higher infrastructure investment. The critical insight: larger models breakeven at lower utilization because they replace more expensive API alternatives.

Traffic patterns determine achievable utilization. Organizations with consistent, predictable workloads achieve higher utilization than those with sporadic demand. Consumer-facing applications with daily traffic cycles waste GPU capacity during off-peak hours unless workloads can be shifted or infrastructure scaled dynamically.

Request volume thresholds establish minimum viable scale. Analysis suggests needing more than 8,000 conversations per day before self-hosted infrastructure costs less than managed solutions.¹⁶ Below this threshold, the operational complexity and fixed costs of self-hosting outweigh potential savings.

Batch processing opportunities improve utilization economics. Organizations with deferrable workloads—offline analysis, batch embeddings, dataset processing—can aggregate demand into high-utilization windows, improving effective utilization even with variable real-time traffic. Mixing real-time and batch workloads on shared infrastructure optimizes capital efficiency.

Cost structure breakdown for production deployments

Production inference costs decompose into components that optimization can address individually.

Model loading and memory consume fixed resources regardless of traffic. A 70B parameter model in FP16 requires approximately 140GB of GPU memory—exceeding single-GPU capacity and mandating multi-GPU configurations.¹⁷ Memory costs scale with model size, not usage, creating minimum infrastructure thresholds regardless of traffic volume.

Compute per token drives marginal costs during inference. Forward pass computation scales with model architecture—attention mechanisms particularly for long contexts. Compute costs decline with batching because matrix operations become more efficient at larger batch sizes, amortizing overhead across more tokens.

KV cache memory grows with context length and concurrent requests. Each active request maintains key-value caches that consume memory proportional to context length. Long-context applications face memory pressure that limits concurrent requests, degrading throughput and increasing per-token costs. KV cache management represents a primary optimization target.

Network and storage I/O impact multi-GPU and distributed deployments. Inter-GPU communication for tensor parallelism, loading model weights from storage, and transmitting results all consume resources. High-bandwidth networking (NVLink, InfiniBand) reduces I/O bottlenecks but increases infrastructure investment.

Operational overhead includes monitoring, logging, security, and management. Production systems require observability infrastructure, on-call personnel, and ongoing optimization effort. Organizations often underestimate these "soft" costs when comparing self-hosted against API alternatives.

Optimization techniques that transform economics

Technical optimizations can reduce inference costs by 60-70% or more, transforming marginal economics into sustainable advantages.¹⁸

Quantization reduces precision of model weights from 32-bit floating point to 8-bit or 4-bit representations. The technique shrinks model size by 4-8x while maintaining acceptable accuracy.¹⁹ 8-bit quantization reduces memory usage 50% with approximately 1% accuracy loss. 4-bit quantization achieves 75% size reduction while maintaining competitive performance for many applications. Blackwell GPUs' FP4 support enables 4x performance gains from quantization alone.

Continuous batching groups requests dynamically rather than waiting for fixed batch completion. Traditional batching waits for the longest sequence to finish before processing new requests. Continuous batching evicts completed sequences immediately and begins new requests while others remain in flight.²⁰ The technique dramatically improves GPU utilization for workloads with variable sequence lengths—exactly the pattern most production deployments exhibit.

Speculative decoding uses a small "draft" model to predict multiple tokens that a larger "verification" model checks in parallel.²¹ When predictions prove correct, multiple tokens generate per forward pass rather than the standard single token. The technique reduces latency 2-3x for applications where a small model can accurately predict the larger model's outputs—particularly effective for constrained domains or structured outputs.

KV cache optimization including PagedAttention manages cache memory like virtual memory, reducing fragmentation and enabling higher concurrency.²² Cache compression techniques reduce memory footprint further. Prefix caching avoids recomputation when requests share common prefixes—valuable for applications with structured prompts or system instructions.

Model distillation creates smaller models that approximate larger model behavior for specific domains. A distilled 7B model matching GPT-4 performance on targeted tasks runs at a fraction of the infrastructure cost while maintaining application-relevant quality.²³ Distillation requires upfront investment in training but produces ongoing inference savings.

Combined, these techniques compound. An organization applying quantization (4x), continuous batching (2x), and speculative decoding (2x) might achieve 16x effective cost reduction compared to naive deployment—transforming economics that seemed marginal into substantial advantages.

API versus self-hosted decision framework

The build-versus-buy decision depends on factors beyond simple cost comparison.

Choose API inference when: - Traffic is sporadic or unpredictable - Volume is below 8,000 conversations per day - Engineering capacity is limited - Rapid iteration on model selection is valuable - Compliance requirements are satisfied by provider certifications - Latency requirements match provider SLAs

Choose self-hosted when: - Traffic is consistent and high-volume - GPU utilization can exceed 50% sustainably - Data sovereignty prevents cloud API usage - Custom models require specialized serving - Latency requirements exceed provider capabilities - Cost optimization justifies engineering investment

Hybrid approaches often prove optimal. Organizations route baseline traffic to self-hosted infrastructure achieving high utilization, then overflow to APIs during demand spikes. Alternatively, sensitive workloads run self-hosted while general applications use APIs. The hybrid model captures benefits of both approaches while mitigating their respective weaknesses.

Hidden API advantages deserve consideration. Providers invest substantially in optimization—their inference costs often undercut what individual organizations can achieve because they operate at scale with dedicated optimization teams.²⁴ Some API pricing reflects VC subsidies, meaning current rates may not persist. However, the operational simplicity of API consumption has genuine value that cost comparisons alone cannot capture.

Hidden self-hosting costs frequently surprise organizations. ML engineering for optimization (quantization, sharding, inference containers), infrastructure management, monitoring and observability, and ongoing performance tuning require specialized expertise.²⁵ Building production inference infrastructure takes substantial engineering investment beyond hardware and hosting costs.

Cloud provider economics comparison

Cloud GPU economics vary significantly across providers, with June 2025 price cuts reshaping the landscape.

Hyperscaler pricing converged after AWS's 44% reduction. AWS H100 instances dropped from approximately $7/hour to $3.90/hour in June 2025.²⁶ Azure remains highest at $6.98/hour, while Google Cloud offers competitive rates around $3.00/hour for A3-High instances with spot pricing as low as $2.25/hour.

Specialized GPU cloud providers consistently undercut hyperscalers. Hyperbolic offers H100 at $1.49/hour—the current market low.²⁷ Lambda Labs provides reserved capacity at $1.85-$1.89/hour. CoreWeave prices around $6.16/hour with InfiniBand networking but offers reserved discounts up to 60%.

Price-performance varies beyond hourly rates. Hyperscalers include ecosystem services—SageMaker, Vertex AI, Azure ML—that reduce engineering burden. Specialized providers offer raw GPU access requiring more operational sophistication. The "true cost" comparison must weigh engineering time against infrastructure savings.

Spot and preemptible instances reduce costs 60-80% for fault-tolerant workloads. Training jobs can checkpoint and resume after preemption. Batch inference tolerates interruption with proper queue management. Real-time serving requiring high availability cannot rely on spot instances regardless of cost advantages.

TPU and alternative accelerator economics

NVIDIA dominance faces increasing competition from alternative accelerators with compelling economics for specific workloads.

Google TPU migration trends demonstrate potential savings. Midjourney moved inference from NVIDIA A100/H100 to TPU v6e, reducing monthly spend from $2.1 million to under $700,000—$16.8 million in annualized savings.²⁸ Waymark reports 4x lower cost than H100, Character.AI achieved 3.8x cost improvement, and Stability AI shifted 40% of image generation inference to TPU v6 in Q3 2025.

TPU economics favor sustained, predictable workloads. Google's pricing model rewards committed usage, and TPU architecture excels at the matrix operations dominating transformer inference. Organizations with stable, high-volume inference should evaluate TPU economics rather than assuming NVIDIA as default.

Custom silicon from cloud providers offers additional alternatives. AWS Inferentia and Trainium, Azure Maia, and Google's continued TPU investment signal that hyperscalers see custom accelerators as economically superior to buying NVIDIA at scale. Organizations locked into hyperscaler ecosystems should explore proprietary accelerators matching their cloud commitment.

Projecting future inference economics

The inference cost trajectory enables strategic planning for AI investments.

Price decline continues but at varying rates depending on capability tier. Achieving baseline performance costs 40-900x less year-over-year depending on the specific benchmark.²⁹ Frontier capabilities decline slower because they represent current limits rather than commoditized functionality.

Hardware improvements compound software optimization gains. Blackwell GPUs provide at least 4x performance through FP4 quantization support.³⁰ Each GPU generation improves price-performance, meaning organizations refreshing infrastructure periodically capture hardware improvements automatically.

Competitive pressure intensifies as Chinese providers, open-source alternatives, and specialized cloud providers compete for inference workloads. DeepSeek's 90% price cut demonstrated that dramatic undercutting remains possible. Organizations should structure contracts and architectures expecting continued price competition.

The commoditization thesis suggests inference becomes utility-like over time. If inference costs approach commodity levels, the primary competition shifts to capability differentiation rather than cost. Organizations investing heavily in inference cost optimization today may find those investments less valuable as the market commoditizes.

Building inference cost intelligence

Organizations should develop systematic capabilities for understanding and optimizing inference economics.

Implement token-level cost tracking that attributes inference spending to applications, users, and use cases. Many organizations know total API spend but cannot determine which applications drive costs or identify optimization targets. Granular attribution enables data-driven optimization.

Benchmark across providers regularly rather than assuming current provider remains optimal. The 10x price variation across providers for identical models means switching can dramatically impact economics without any technical optimization.³¹ Quarterly benchmarking ensures organizations capture price improvements.

Model optimization investment yields compounding returns. A quantization implementation that reduces costs 50% generates savings continuously until models change. Optimization engineering should be evaluated against ongoing savings rather than one-time project costs.

Stay current on techniques as inference optimization evolves rapidly. Speculative decoding, continuous batching, KV cache optimization, and other techniques emerged recently and continue improving. Organizations lacking inference expertise should consider partnerships—Introl's infrastructure deployment experience across our global coverage area includes optimizing inference deployments for cost efficiency.

The inference economics landscape rewards organizations that understand true costs, optimize systematically, and structure infrastructure for the trajectory of declining prices. Those treating inference as a static cost center will find themselves subsidizing competitors who treat inference optimization as a strategic capability. In a market where costs decline 10x annually, standing still means falling behind.

References

Andreessen Horowitz. "Welcome to LLMflation - LLM inference cost is going down fast." a16z, 2024. https://a16z.com/llmflation-llm-inference-cost/
Epoch AI. "LLM inference prices have fallen rapidly but unequally across tasks." Epoch AI Data Insights, 2024. https://epoch.ai/data-insights/llm-inference-price-trends
IntuitionLabs. "LLM API Pricing Comparison (2025): OpenAI, Gemini, Claude." IntuitionLabs Articles, 2025. https://intuitionlabs.ai/articles/llm-api-pricing-comparison-2025
MIGRI TGT. "Observations About LLM Inference Pricing." Tech Gov Intelligence, 2024. https://techgov.intelligence.org/blog/observations-about-llm-inference-pricing
IntuitionLabs. "LLM API Pricing Comparison (2025): OpenAI, Gemini, Claude." IntuitionLabs Articles, 2025.
———. "LLM API Pricing Comparison (2025): OpenAI, Gemini, Claude." IntuitionLabs Articles, 2025.
———. "LLM API Pricing Comparison (2025): OpenAI, Gemini, Claude." IntuitionLabs Articles, 2025.
LessWrong. "Observations About LLM Inference Pricing." LessWrong, 2024. https://www.lesswrong.com/posts/mRKd4ArA5fYhd2BPb/observations-about-llm-inference-pricing
———. "Observations About LLM Inference Pricing." LessWrong, 2024.
JarvisLabs. "NVIDIA H100 Price Guide 2025: Detailed Costs, Comparisons & Expert Insights." JarvisLabs Documentation, 2025. https://docs.jarvislabs.ai/blog/h100-price
Hyperbolic. "GPU Cloud Pricing: 2025 Guide to Costs, Models & Optimization." Hyperbolic Blog, 2025. https://www.hyperbolic.ai/blog/gpu-cloud-pricing
GMI Cloud. "How Much Does the NVIDIA H100 GPU Cost in 2025? Buy vs. Rent." GMI Cloud Blog, 2025. https://www.gmicloud.ai/blog/how-much-does-the-nvidia-h100-gpu-cost-in-2025-buy-vs-rent-analysis
Cost of Inference. "Cost Of Inference - Home." Danny Castonguay Blog, 2024. https://blog.dannycastonguay.com/Cost-of-Inference/
———. "Cost Of Inference - Home." Danny Castonguay Blog, 2024.
Medium. "LLMs deployment: a practical cost analysis." Artefact Engineering and Data Science, 2024. https://medium.com/artefact-engineering-and-data-science/llms-deployment-a-practical-cost-analysis-e0c1b8eb08ca
———. "LLMs deployment: a practical cost analysis." Artefact Engineering and Data Science, 2024.
NVIDIA. "Mastering LLM Techniques: Inference Optimization." NVIDIA Technical Blog, 2023. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Clarifai. "LLM Inference Optimization Techniques." Clarifai Guide, 2024. https://www.clarifai.com/blog/llm-inference-optimization/
Hugging Face. "Optimizing inference." Transformers Documentation, 2024. https://huggingface.co/docs/transformers/main/en/llm_optims
NVIDIA. "Mastering LLM Techniques: Inference Optimization." NVIDIA Technical Blog, 2023.
Adaline. "What is LLM Inference Optimization: Techniques and Implementation Guide." Adaline Blog, 2024. https://www.adaline.ai/blog/what-is-llm-inference-optimization
Deepsense.ai. "LLM Inference Optimization: How to Speed Up, Cut Costs, and Scale AI Models." Deepsense.ai Blog, 2024. https://deepsense.ai/blog/llm-inference-optimization-how-to-speed-up-cut-costs-and-scale-ai-models/
ArXiv. "Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI's LLM with Open Source SLMs in Production." ArXiv, 2024. https://arxiv.org/html/2312.14972v3
Snellman. "LLMs are cheap." Snellman Blog, June 2025. https://www.snellman.net/blog/archive/2025-06-02-llms-are-cheap/
Medium. "The LLM Deployment Dilemma: Third-Party APIs or Self-Hosting?" Medium, 2024. https://medium.com/@michael_england/the-llm-deployment-dilemma-third-party-apis-or-self-hosting-5d18788df444
Cast.AI. "2025 GPU Price Report – A100 & H100 Cost." Cast.AI Reports, 2025. https://cast.ai/reports/gpu-price/
Hyperbolic. "GPU Cloud Pricing: 2025 Guide to Costs, Models & Optimization." Hyperbolic Blog, 2025.
AI News Hub. "Nvidia to Google TPU Migration 2025: The $6.32B Inference Cost Crisis." AI News Hub, 2025. https://www.ainewshub.org/post/nvidia-vs-google-tpu-2025-cost-comparison
Epoch AI. "LLM inference prices have fallen rapidly but unequally across tasks." Epoch AI Data Insights, 2024.
NVIDIA. "LLM Inference Benchmarking: How Much Does Your LLM Inference Cost?" NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/llm-inference-benchmarking-how-much-does-your-llm-inference-cost/
LessWrong. "Observations About LLM Inference Pricing." LessWrong, 2024.

The inference pricing landscape in December 2025

Understanding true GPU infrastructure costs

The utilization equation that determines viability

Cost structure breakdown for production deployments

Optimization techniques that transform economics

API versus self-hosted decision framework

Cloud provider economics comparison

TPU and alternative accelerator economics

Projecting future inference economics

Building inference cost intelligence

References

You Might Also Like

HBM evolution: from HBM3 to HBM4 and the AI memory war

The Death of Data Center Geography: Why Traditional Markets ...

Retail AI infrastructure: from recommendation systems to inv...

Request a Quote_

Request Received_