Voice AI Infrastructure: Building Real-Time Speech Agents
Updated December 11, 2025
December 2025 Update: Deepgram STT at 150ms, ElevenLabs TTS at 75ms—yet most agents taking 800ms-2s due to stack latency compounding. Human conversation requires 300-500ms response window. Pipeline latency: STT (100-500ms) + LLM (350ms-1s+) + TTS (75-200ms). Every millisecond matters for production voice agents.
Deepgram delivers speech-to-text in 150 milliseconds. ElevenLabs synthesizes voice in 75 milliseconds. Yet most voice AI agents still take 800 milliseconds to two seconds to respond—because latency compounds across the entire stack.¹ The gap between component capabilities and end-to-end performance reveals the infrastructure challenge at the heart of voice AI: orchestrating speech recognition, language models, and synthesis into pipelines that match human conversational timing.
Human conversation operates within a 300-500 millisecond response window.² Delays beyond 500 milliseconds feel unnatural. Beyond 1.2 seconds, users hang up or interrupt. Building voice agents that meet these thresholds requires understanding each layer of the stack, selecting appropriate components, and architecting systems where every millisecond counts.
The voice AI stack
Every voice agent relies on four components working in concert:³
Speech-to-Text (STT/ASR): The "ears" that transcribe spoken audio into text. Latency ranges from 100-500 milliseconds depending on streaming configuration.
Large Language Model (LLM): The "brain" that processes transcribed text and generates responses. Latency ranges from 350 milliseconds for optimized models to over one second for frontier models.
Text-to-Speech (TTS): The "voice" that synthesizes response text into audio. Modern streaming TTS achieves 75-200 millisecond time-to-first-audio.
Orchestration: The "conductor" managing real-time flow between components, handling turn-taking, interruptions, and session state.
The latency equation
Voice AI latency accumulates across the pipeline:⁴
Total Latency = STT + LLM + TTS + Network + Processing
= 200ms + 500ms + 150ms + 50ms + 100ms
= 1000ms (typical)
Achieving sub-500 millisecond responses requires either compressing each component or parallelizing the pipeline through streaming—starting speech synthesis before the LLM finishes generating, processing partial transcriptions before users finish speaking.
Speech-to-text infrastructure
The ASR layer converts audio streams into text that language models can process. Provider selection involves balancing latency, accuracy, and cost.
Provider comparison
Deepgram Nova-3:⁵ - Time-to-first-token: ~150ms (US), 250-350ms (global) - Word error rate: 18.3% - Streaming optimized with real-time factor 0.2-0.3x - Pricing: $0.0043/minute (pay-as-you-go) - Best for: Low-latency voice agents prioritizing speed
AssemblyAI Universal-2:⁶ - Latency: 300-600ms - Word error rate: 14.5% (best accuracy among streaming models) - Strong domain-specific performance in medical and sales contexts - Pricing: $0.00025/second - Best for: Applications requiring accuracy over raw speed
Whisper (self-hosted):⁷ - Latency: 1-5 seconds (batch), 380-520ms (WhisperX optimized) - Highest accuracy for offline transcription - Requires significant engineering for production streaming - Best for: Batch processing, hybrid architectures
Groq-accelerated Whisper: - Latency: Sub-300ms on LPU hardware - Combines Whisper accuracy with streaming latency - Limited availability through GroqCloud - Best for: Quality-focused real-time applications
ASR infrastructure patterns
Streaming architecture: Begin transcription immediately as audio arrives rather than waiting for complete utterances. Partial results feed downstream components before users finish speaking.
# Streaming ASR pattern
async def transcribe_stream(audio_stream):
async for chunk in audio_stream:
partial = await asr_client.transcribe_chunk(chunk)
if partial.is_final:
yield partial.text
else:
# Send interim results for prediction
yield partial.interim_text
Voice Activity Detection (VAD): Detect when users start and stop speaking. Poor VAD creates either premature cutoffs (interrupting users) or excessive delays (waiting for silence that already occurred).
Endpointing: Determine when a user has finished their turn. Aggressive endpointing reduces latency but risks cutting off speakers. Conservative endpointing ensures completeness but adds delay.
GPU requirements for self-hosted ASR
Self-hosted Whisper deployments require GPU acceleration:⁸
| Workload Level | GPU | Concurrent Streams |
|---|---|---|
| Development | RTX 3060/4060 | 5-10 |
| Production | A100 40GB | 50-100 |
| Enterprise | H100 | 200+ |
Production speech-to-text typically runs on A100 or RTX 6000 Ada rather than H100—the workload benefits more from memory bandwidth than raw compute.
Large language model layer
The LLM processes transcribed speech and generates response text. Model selection dramatically affects both latency and conversation quality.
Model latency profiles
Ultra-fast (sub-350ms):⁹ - Gemini Flash 1.5: ~300ms time-to-first-token - Groq-served Llama: ~200ms on LPU - Best for: Maximum responsiveness, simpler queries
Fast (350-700ms): - GPT-4o-mini: ~400ms - Claude 3.5 Haiku: ~350ms - Best for: Balanced speed and capability
Standard (700ms-1s+): - GPT-4o: ~700ms - Claude 3.5 Sonnet: ~800ms - Best for: Complex reasoning, quality-critical applications
Optimization strategies
Streaming generation: Begin TTS synthesis as LLM tokens arrive rather than waiting for complete responses. Modern orchestration pipelines stream tokens directly to speech synthesis.
Speculative execution: Predict likely responses based on partial transcriptions. Start generating responses before users finish speaking, discarding predictions that don't match final intent.
Model routing: Route simple queries to fast models, complex queries to capable models. A classifier determines query complexity in single-digit milliseconds.
# Model routing pattern
def route_query(transcript, context):
complexity = classify_complexity(transcript)
if complexity == "simple":
return "gemini-flash"
elif complexity == "moderate":
return "gpt-4o-mini"
else:
return "gpt-4o"
Prompt optimization: Shorter prompts reduce processing time. Cache system prompts where providers support prompt caching (Anthropic achieves 90% cost reduction on cached prefixes).
Text-to-speech infrastructure
TTS converts LLM-generated text into natural-sounding speech. The layer has transformed from a bottleneck (2-3 seconds historically) to a strength (75-150ms with modern providers).
Provider comparison
ElevenLabs Flash v2.5:¹⁰ - Time-to-first-audio: 75ms - Voice quality: Industry-leading naturalness - Emotional range: Excellent expressiveness - Pricing: $0.050/1,000 characters - Best for: Quality-critical applications
Cartesia Sonic:¹¹ - Time-to-first-audio: 40-95ms - Purpose-built for real-time conversation - Consistent low latency under load - Pricing: $0.038/1,000 characters - Best for: Latency-critical applications
Deepgram Aura-2:¹² - Time-to-first-audio: Sub-150ms - Enterprise-grade reliability - Cost-effective at scale - Pricing: $0.030/1,000 characters - Best for: High-volume enterprise deployments
PlayHT: - Latency: ~300ms - Extensive voice library - Voice cloning capabilities - Lower price point - Best for: Budget-conscious applications
TTS infrastructure patterns
Streaming synthesis: Generate audio progressively as text arrives from the LLM. Send audio chunks to users before complete sentences finish synthesizing.
Audio buffering: Maintain small buffers to smooth playback despite variable synthesis timing. Buffer too much and latency suffers. Buffer too little and audio stutters.
Voice caching: Cache frequently used phrases (greetings, common responses) as pre-synthesized audio. Eliminates TTS latency entirely for cached content.
Orchestration platforms
Orchestration layers connect ASR, LLM, and TTS components while handling telephony, turn-taking, and session management. Platform selection determines development velocity and production reliability.
Platform comparison
Vapi:¹³ - Focus: Turnkey voice agent platform - Telephony: Native SIP/PSTN integration - Customization: Modular component selection - Pricing: $0.05/minute + component costs - Best for: Rapid deployment, telephone-focused applications
LiveKit:¹⁴ - Focus: Open-source real-time infrastructure - Architecture: WebRTC-native with agent framework - Customization: Full control, self-hostable - Pricing: Free tier (100 concurrent, 5,000 minutes/month), paid from $50/month - Best for: Custom applications, teams needing full control
Retell AI:¹⁵ - Focus: Natural conversation flow - Differentiator: Optimized turn-taking and interruption handling - Compliance: HIPAA and SOC 2 Type II - Pricing: $0.07+/minute - Best for: Conversation quality priority, enterprise compliance
Pipecat: - Focus: Open-source agent framework - Integration: Works with major cloud providers - Customization: Highly flexible pipeline construction - Best for: Developers wanting framework without platform lock-in
Selection criteria
| Factor | Vapi | LiveKit | Retell |
|---|---|---|---|
| Telephony integration | Excellent | Good (via SIP) | Excellent |
| Customization | High | Highest | Moderate |
| Setup complexity | Low | Moderate | Low |
| Self-hosting | No | Yes | No |
| Enterprise features | Good | Growing | Excellent |
Architecture patterns
Cascading pipeline (ASR → LLM → TTS)
The traditional architecture processes audio through discrete stages:¹⁶
Audio → ASR → Text → LLM → Response Text → TTS → Audio
Advantages: - Component modularity (swap providers easily) - Mature tooling and debugging - Predictable cost structure (~$0.15/minute regardless of conversation length) - Transparent intermediate representations (text is inspectable)
Challenges: - Latency accumulation across stages - Information loss in text representation (prosody, emotion) - Complex streaming coordination
Speech-to-speech (S2S)
End-to-end models process audio directly to audio:¹⁷
Audio → Multimodal Model → Audio
Examples: - GPT-4o voice mode - Moshi (Kyutai Labs) - Ultravox
Advantages: - Preserves prosodic information - Potentially lower latency (single model) - Handles overlapping speech naturally
Challenges: - Higher cost (~$0.30-1.50/minute for longer conversations) - Limited customization (cannot swap components) - Debugging opacity (no intermediate text)
Hybrid approaches
Production systems increasingly combine architectures:
Cascading with S2S fallback: Use cascading for standard interactions, switch to S2S for complex overlapping dialogue.
Parallel processing: Run ASR and intent prediction simultaneously. Begin response generation based on predicted intent while ASR completes.
Speculative TTS: Pre-generate likely response audio. Play cached audio immediately if prediction matches; fall back to synthesis otherwise.
Scaling voice AI infrastructure
Concurrent capacity planning
Voice AI scales differently than text-based AI. Each concurrent call requires dedicated processing resources across the pipeline.¹⁸
Per-GPU capacity (self-hosted):
| GPU | ASR Streams | LLM Concurrent | TTS Streams |
|---|---|---|---|
| L4 | 50 | 20-30 | 100 |
| L40S | 100 | 50-75 | 200 |
| A100 | 100 | 75-100 | 250 |
| H100 | 200+ | 150-200 | 400+ |
Managed service capacity: Cloud providers handle scaling automatically but enforce rate limits. Enterprise agreements unlock dedicated capacity for predictable high-volume deployments.
Geographic distribution
Voice latency includes network round-trip time. Global deployments require regional infrastructure:¹⁹
Colocation strategy: Place inference infrastructure at telephony points of presence. Telnyx colocates GPUs with telephony PoPs for ultra-low latency.
Edge deployment: Run ASR and TTS at edge locations while centralizing LLM inference. Reduces audio transport latency while maintaining model capability.
Regional failover: Maintain capacity in multiple regions for resilience. Route calls to nearest healthy region.
Microservices architecture
Production voice AI benefits from service decomposition:²⁰
┌─────────────┐
│ Gateway │
└──────┬──────┘
┌─────────────────┼─────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ ASR │ │ LLM │ │ TTS │
│ Service │ │ Service │ │ Service │
└─────────┘ └───────────┘ └───────────┘
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ ASR │ │ LLM │ │ TTS │
│ Pods │ │ Pods │ │ Pods │
└─────────┘ └───────────┘ └───────────┘
Benefits: - Independent scaling per component - Fault isolation (ASR failure doesn't crash TTS) - Technology flexibility (different languages per service) - Resource optimization (right-size each service)
Cost modeling
Per-minute economics
Voice AI costs accumulate across the stack:²¹
Managed platform (typical):
ASR: $0.006/minute (Deepgram)
LLM: $0.02-0.10/minute (varies by model)
TTS: $0.02/minute (mid-tier)
Orchestration: $0.05/minute (Vapi)
Telephony: $0.01/minute
Total: $0.10-0.20/minute
Self-hosted (at scale):
GPU compute: $0.03-0.05/minute (amortized)
Infrastructure: $0.01/minute
Engineering overhead: Variable
Total: $0.05-0.10/minute (higher volume)
Cost optimization strategies
Model tiering: Route simple queries to cheaper models. Save expensive frontier models for complex conversations.
Caching: Cache common responses as audio. Eliminate ASR/LLM/TTS costs for cached interactions entirely.
Off-peak scheduling: Batch non-real-time voice processing during lower-cost compute windows.
Provider negotiation: High-volume deployments unlock significant discounts. Negotiate committed use agreements.
Organizations deploying voice AI infrastructure at scale can leverage Introl's global expertise for GPU deployment and optimization across 257 locations worldwide.
Production considerations
Latency monitoring
Track latency at each pipeline stage:
Key metrics: - Time-to-first-byte (TTFB) per component - End-to-end response latency (user speech end → agent speech start) - P50, P95, P99 latency distributions - Latency by geographic region
Alerting thresholds: - P95 > 800ms: Warning - P95 > 1200ms: Critical - Component TTFB > 2x baseline: Investigate
Reliability patterns
Circuit breakers: Fail fast when downstream services degrade. Route to backup providers automatically.
Graceful degradation: Fall back to simpler models or cached responses when primary services fail.
Health checks: Monitor each component independently. Remove unhealthy instances from rotation.
Compliance requirements
HIPAA (healthcare): Ensure BAAs cover all pipeline components. Voice data contains PHI.
PCI-DSS (payments): Voice recordings may contain card numbers. Implement appropriate redaction.
GDPR/CCPA: Voice biometrics constitute personal data. Implement consent and deletion workflows.
The sub-500ms imperative
Voice AI infrastructure succeeds or fails on latency. Users abandon conversations that feel sluggish. They engage with agents that respond naturally. The 500-millisecond threshold separating natural from artificial interaction drives every architectural decision.
Achieving production-grade voice AI requires optimizing each pipeline component while architecting systems that parallelize processing wherever possible. Streaming ASR feeds partial transcriptions to speculative LLM processing. Streaming TTS begins speaking before responses complete. The entire system operates as a continuous flow rather than discrete stages.
The infrastructure investment pays returns in user experience. Voice agents that respond in 400 milliseconds feel like conversations. Those responding in 1.5 seconds feel like telephone menus. For organizations building voice AI products, the latency optimization effort determines whether users adopt or abandon the technology.
The stack has matured. Deepgram delivers 150ms ASR. ElevenLabs achieves 75ms TTS. Groq serves LLMs at 200ms. The components exist to build sub-500ms voice agents. Success depends on infrastructure that orchestrates these components into seamless conversational experiences.
References
-
Retell AI. "AI Voice Agent Latency Face-Off 2025." 2025. https://www.retellai.com/resources/ai-voice-agent-latency-face-off-2025
-
TringTring.AI. "Understanding Latency in AI Voice Agents: Why Sub-500ms Matters." 2025. https://tringtring.ai/blog/technical-deep-dive/understanding-latency-in-ai-voice-agents-why-sub-500ms-matters/
-
AssemblyAI. "The voice AI stack for building agents in 2025." 2025. https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents
-
Twilio. "Core Latency in AI Voice Agents." 2025. https://www.twilio.com/en-us/blog/developers/best-practices/guide-core-latency-ai-voice-agents
-
Deepgram. "Speech-to-Text API." 2025. https://deepgram.com/product/speech-to-text
-
AssemblyAI. "Top APIs and models for real-time speech recognition and transcription in 2025." 2025. https://www.assemblyai.com/blog/best-api-models-for-real-time-speech-recognition-and-transcription
-
Ionio. "2025 Edge Speech-to-Text Model Benchmark: Whisper vs. Competitors." 2025. https://www.ionio.ai/blog/2025-edge-speech-to-text-model-benchmark-whisper-vs-competitors
-
Dell Technologies. "Benchmarking NVIDIA GPU Throughput for LLMs." 2025. https://infohub.delltechnologies.com/en-us/p/benchmarking-nvidia-gpu-throughput-for-llms-and-understanding-gpu-configuration-choices-in-the-ai-space/
-
ElevenLabs. "How do you optimize latency for Conversational AI?" 2025. https://elevenlabs.io/blog/how-do-you-optimize-latency-for-conversational-ai
-
ElevenLabs. "ElevenLabs vs. Cartesia." June 2025. https://elevenlabs.io/blog/elevenlabs-vs-cartesia
-
Podcastle. "Streaming TTS benchmark: Async vs Elevenlabs vs Cartesia." 2025. https://podcastle.ai/blog/tts-latency-vs-quality-benchmark/
-
Deepgram. "Introducing Aura-2: Enterprise-Grade Text-to-Speech." 2025. https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech
-
Softcery. "11 Voice Agent Platforms Compared." 2025. https://softcery.com/lab/choosing-the-right-voice-agent-platform-in-2025
-
Modal. "LiveKit vs Vapi: which voice AI framework is best in 2025?" 2025. https://modal.com/blog/livekit-vs-vapi-article
-
Retell AI. "Top 11 Vapi AI Alternatives for Conversational Voice Agents." 2025. https://www.retellai.com/blog/best-vapi-alternatives-for-enterprise-voice-ai
-
Brain.co. "Chained Voice Agent Architectures: Speech-to-Speech vs Chained Pipeline vs Hybrid Approaches." 2025. https://brain.co/blog/chained-voice-agent-architectures-speech-to-speech-vs-chained-pipeline-vs-hybrid-approaches
-
Softcery. "Real-Time vs Turn-Based Voice Agent Architecture." 2025. https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture
-
ZenML. "ElevenLabs: Scaling Voice AI with GPU-Accelerated Infrastructure." 2025. https://www.zenml.io/llmops-database/scaling-voice-ai-with-gpu-accelerated-infrastructure
-
Telnyx. "Voice AI Latency: Why colocation is mission critical." 2025. https://telnyx.com/resources/how-telnyx-fixed-voice-ai-latency-with-co-located-infrastructure
-
Redapt. "Empowering Scalable Voice AI with Optimized Infrastructure." 2025. https://www.redapt.com/customer-stories/empowering-scalable-voice-ai-with-optimized-infrastructure
-
Cerebrium. "Deploying a global scale, AI voice agent with 500ms latency." 2025. https://www.cerebrium.ai/blog/deploying-a-global-scale-ai-voice-agent-with-500ms-latency
Key takeaways
For infrastructure architects: - Latency equation: STT (200ms) + LLM (500ms) + TTS (150ms) + network (50ms) + processing (100ms) = 1,000ms typical - Sub-500ms threshold for natural conversation; users abandon at 1.2+ seconds; 300-500ms matches human response timing - GPU capacity per-device: L4 (50 ASR/20 LLM/100 TTS streams), A100 (100/75/250), H100 (200+/150/400+)
For provider selection: - ASR: Deepgram Nova-3 150ms TTFT at $0.0043/min; AssemblyAI 300-600ms but 14.5% WER (best accuracy) - TTS: ElevenLabs Flash 75ms at $0.050/1K chars (best quality); Cartesia 40-95ms at $0.038/1K (latency-critical) - Orchestration: Vapi $0.05/min (fastest setup), LiveKit (open source, self-hostable), Retell (HIPAA/SOC 2)
For finance teams: - Managed platform: $0.10-0.20/minute total (ASR + LLM + TTS + orchestration + telephony) - Self-hosted at scale: $0.05-0.10/minute (GPU compute $0.03-0.05/min amortized) - S2S models (GPT-4o voice): $0.30-1.50/minute for longer conversations; cascading ~$0.15/min regardless of length
For operations teams: - Streaming architecture mandatory: begin TTS as LLM tokens arrive, process partial transcriptions before speech ends - Model routing saves cost: simple queries to Gemini Flash (300ms), complex to GPT-4o (700ms) - Alerting: P95 > 800ms warning, P95 > 1200ms critical, component TTFB > 2x baseline investigate
SEO Elements
Squarespace Excerpt (158 characters)
Voice AI infrastructure requires sub-500ms latency for natural conversation. Complete guide to ASR, LLM, TTS stack and orchestration platforms like Vapi and LiveKit.
SEO Title (54 characters)
Voice AI Infrastructure: Real-Time Speech Agent Guide
SEO Description (155 characters)
Build voice AI agents with sub-500ms latency. Compare ASR providers (Deepgram, AssemblyAI), TTS options (ElevenLabs, Cartesia), and orchestration platforms.
Title Review
Current title "Voice AI Infrastructure: Building Real-Time Speech Agents" works at 54 characters. Alternatives: - "Real-Time Voice AI: ASR to TTS Infrastructure Guide 2025" (53 chars) - "Voice Agent Infrastructure: Sub-500ms Latency Guide 2025" (53 chars)
URL Slug Recommendations
Primary: voice-ai-infrastructure-real-time-speech-agents-asr-tts-guide-2025 Alternative 1: voice-ai-stack-deepgram-elevenlabs-vapi-livekit-guide Alternative 2: real-time-voice-agents-asr-llm-tts-infrastructure-2025 Alternative 3: voice-ai-latency-optimization-sub-500ms-guide-2025