Voice AI Infrastructure: Building Real-Time Speech Agents

Deepgram delivers speech-to-text in 150 milliseconds. ElevenLabs synthesizes voice in 75 milliseconds. Yet most voice AI agents still take 800 milliseconds to two seconds to respond—because latency

Blake Crosley

Jan 28, 2026 12 min read Disclaimer

Voice AI Infrastructure: Building Real-Time Speech Agents

December 2025 Update: Deepgram STT at 150ms, ElevenLabs TTS at 75ms—yet most agents taking 800ms-2s due to stack latency compounding. Human conversation requires 300-500ms response window. Pipeline latency: STT (100-500ms) + LLM (350ms-1s+) + TTS (75-200ms). Every millisecond matters for production voice agents.

Deepgram delivers speech-to-text in 150 milliseconds. ElevenLabs synthesizes voice in 75 milliseconds. Yet most voice AI agents still take 800 milliseconds to two seconds to respond—because latency compounds across the entire stack.¹ The gap between component capabilities and end-to-end performance reveals the infrastructure challenge at the heart of voice AI: orchestrating speech recognition, language models, and synthesis into pipelines that match human conversational timing.

Human conversation operates within a 300-500 millisecond response window.² Delays beyond 500 milliseconds feel unnatural. Beyond 1.2 seconds, users hang up or interrupt. Building voice agents that meet these thresholds requires understanding each layer of the stack, selecting appropriate components, and architecting systems where every millisecond counts.

The voice AI stack

Every voice agent relies on four components working in concert:³

Speech-to-Text (STT/ASR): The "ears" that transcribe spoken audio into text. Latency ranges from 100-500 milliseconds depending on streaming configuration.

Large Language Model (LLM): The "brain" that processes transcribed text and generates responses. Latency ranges from 350 milliseconds for optimized models to over one second for frontier models.

Text-to-Speech (TTS): The "voice" that synthesizes response text into audio. Modern streaming TTS achieves 75-200 millisecond time-to-first-audio.

Orchestration: The "conductor" managing real-time flow between components, handling turn-taking, interruptions, and session state.

The latency equation

Voice AI latency accumulates across the pipeline:⁴

Total Latency = STT + LLM + TTS + Network + Processing
             = 200ms + 500ms + 150ms + 50ms + 100ms
             = 1000ms (typical)

Achieving sub-500 millisecond responses requires either compressing each component or parallelizing the pipeline through streaming—starting speech synthesis before the LLM finishes generating, processing partial transcriptions before users finish speaking.

Speech-to-text infrastructure

The ASR layer converts audio streams into text that language models can process. Provider selection involves balancing latency, accuracy, and cost.

Provider comparison

Deepgram Nova-3:⁵ - Time-to-first-token: ~150ms (US), 250-350ms (global) - Word error rate: 18.3% - Streaming optimized with real-time factor 0.2-0.3x - Pricing: $0.0043/minute (pay-as-you-go) - Best for: Low-latency voice agents prioritizing speed

AssemblyAI Universal-2:⁶ - Latency: 300-600ms - Word error rate: 14.5% (best accuracy among streaming models) - Strong domain-specific performance in medical and sales contexts - Pricing: $0.00025/second - Best for: Applications requiring accuracy over raw speed

Whisper (self-hosted):⁷ - Latency: 1-5 seconds (batch), 380-520ms (WhisperX optimized) - Highest accuracy for offline transcription - Requires significant engineering for production streaming - Best for: Batch processing, hybrid architectures

Groq-accelerated Whisper: - Latency: Sub-300ms on LPU hardware - Combines Whisper accuracy with streaming latency - Limited availability through GroqCloud - Best for: Quality-focused real-time applications

ASR infrastructure patterns

Streaming architecture: Begin transcription immediately as audio arrives rather than waiting for complete utterances. Partial results feed downstream components before users finish speaking.

# Streaming ASR pattern
async def transcribe_stream(audio_stream):
    async for chunk in audio_stream:
        partial = await asr_client.transcribe_chunk(chunk)
        if partial.is_final:
            yield partial.text
        else:
            # Send interim results for prediction
            yield partial.interim_text

Voice Activity Detection (VAD): Detect when users start and stop speaking. Poor VAD creates either premature cutoffs (interrupting users) or excessive delays (waiting for silence that already occurred).

Endpointing: Determine when a user has finished their turn. Aggressive endpointing reduces latency but risks cutting off speakers. Conservative endpointing ensures completeness but adds delay.

GPU requirements for self-hosted ASR

Self-hosted Whisper deployments require GPU acceleration:⁸

Workload Level	GPU	Concurrent Streams
Development	RTX 3060/4060	5-10
Production	A100 40GB	50-100
Enterprise	H100	200+

Production speech-to-text typically runs on A100 or RTX 6000 Ada rather than H100—the workload benefits more from memory bandwidth than raw compute.

Large language model layer

The LLM processes transcribed speech and generates response text. Model selection dramatically affects both latency and conversation quality.

Model latency profiles

Ultra-fast (sub-350ms):⁹ - Gemini Flash 1.5: ~300ms time-to-first-token - Groq-served Llama: ~200ms on LPU - Best for: Maximum responsiveness, simpler queries

Fast (350-700ms): - GPT-4o-mini: ~400ms - Claude 3.5 Haiku: ~350ms - Best for: Balanced speed and capability

Standard (700ms-1s+): - GPT-4o: ~700ms - Claude 3.5 Sonnet: ~800ms - Best for: Complex reasoning, quality-critical applications

Optimization strategies

Streaming generation: Begin TTS synthesis as LLM tokens arrive rather than waiting for complete responses. Modern orchestration pipelines stream tokens directly to speech synthesis.

Speculative execution: Predict likely responses based on partial transcriptions. Start generating responses before users finish speaking, discarding predictions that don't match final intent.

Model routing: Route simple queries to fast models, complex queries to capable models. A classifier determines query complexity in single-digit milliseconds.

# Model routing pattern
def route_query(transcript, context):
    complexity = classify_complexity(transcript)
    if complexity == "simple":
        return "gemini-flash"
    elif complexity == "moderate":
        return "gpt-4o-mini"
    else:
        return "gpt-4o"

Prompt optimization: Shorter prompts reduce processing time. Cache system prompts where providers support prompt caching (Anthropic achieves 90% cost reduction on cached prefixes).

Text-to-speech infrastructure

TTS converts LLM-generated text into natural-sounding speech. The layer has transformed from a bottleneck (2-3 seconds historically) to a strength (75-150ms with modern providers).

Provider comparison

ElevenLabs Flash v2.5:¹⁰ - Time-to-first-audio: 75ms - Voice quality: Industry-leading naturalness - Emotional range: Excellent expressiveness - Pricing: $0.050/1,000 characters - Best for: Quality-critical applications

Cartesia Sonic:¹¹ - Time-to-first-audio: 40-95ms - Purpose-built for real-time conversation - Consistent low latency under load - Pricing: $0.038/1,000 characters - Best for: Latency-critical applications

Deepgram Aura-2:¹² - Time-to-first-audio: Sub-150ms - Enterprise-grade reliability - Cost-effective at scale - Pricing: $0.030/1,000 characters - Best for: High-volume enterprise deployments

PlayHT: - Latency: ~300ms - Extensive voice library - Voice cloning capabilities - Lower price point - Best for: Budget-conscious applications

TTS infrastructure patterns

Streaming synthesis: Generate audio progressively as text arrives from the LLM. Send audio chunks to users before complete sentences finish synthesizing.

Audio buffering: Maintain small buffers to smooth playback despite variable synthesis timing. Buffer too much and latency suffers. Buffer too little and audio stutters.

Voice caching: Cache frequently used phrases (greetings, common responses) as pre-synthesized audio. Eliminates TTS latency entirely for cached content.

Orchestration platforms

Orchestration layers connect ASR, LLM, and TTS components while handling telephony, turn-taking, and session management. Platform selection determines development velocity and production reliability.

Platform comparison

Vapi:¹³ - Focus: Turnkey voice agent platform - Telephony: Native SIP/PSTN integration - Customization: Modular component selection - Pricing: $0.05/minute + component costs - Best for: Rapid deployment, telephone-focused applications

LiveKit:¹⁴ - Focus: Open-source real-time infrastructure - Architecture: WebRTC-native with agent framework - Customization: Full control, self-hostable - Pricing: Free tier (100 concurrent, 5,000 minutes/month), paid from $50/month - Best for: Custom applications, teams needing full control

Retell AI:¹⁵ - Focus: Natural conversation flow - Differentiator: Optimized turn-taking and interruption handling - Compliance: HIPAA and SOC 2 Type II - Pricing: $0.07+/minute - Best for: Conversation quality priority, enterprise compliance

Pipecat: - Focus: Open-source agent framework - Integration: Works with major cloud providers - Customization: Highly flexible pipeline construction - Best for: Developers wanting framework without platform lock-in

Selection criteria

Factor	Vapi	LiveKit	Retell
Telephony integration	Excellent	Good (via SIP)	Excellent
Customization	High	Highest	Moderate
Setup complexity	Low	Moderate	Low
Self-hosting	No	Yes	No
Enterprise features	Good	Growing	Excellent

Architecture patterns

Cascading pipeline (ASR → LLM → TTS)

The traditional architecture processes audio through discrete stages:¹⁶

Audio → ASR → Text → LLM → Response Text → TTS → Audio

Advantages: - Component modularity (swap providers easily) - Mature tooling and debugging - Predictable cost structure (~$0.15/minute regardless of conversation length) - Transparent intermediate representations (text is inspectable)

Challenges: - Latency accumulation across stages - Information loss in text representation (prosody, emotion) - Complex streaming coordination

Speech-to-speech (S2S)

End-to-end models process audio directly to audio:¹⁷

Audio → Multimodal Model → Audio

Examples: - GPT-4o voice mode - Moshi (Kyutai Labs) - Ultravox

Advantages: - Preserves prosodic information - Potentially lower latency (single model) - Handles overlapping speech naturally

Challenges: - Higher cost (~$0.30-1.50/minute for longer conversations) - Limited customization (cannot swap components) - Debugging opacity (no intermediate text)

Hybrid approaches

Production systems increasingly combine architectures:

Cascading with S2S fallback: Use cascading for standard interactions, switch to S2S for complex overlapping dialogue.

Parallel processing: Run ASR and intent prediction simultaneously. Begin response generation based on predicted intent while ASR completes.

Speculative TTS: Pre-generate likely response audio. Play cached audio immediately if prediction matches; fall back to synthesis otherwise.

Scaling voice AI infrastructure

Concurrent capacity planning

Voice AI scales differently than text-based AI. Each concurrent call requires dedicated processing resources across the pipeline.¹⁸

Per-GPU capacity (self-hosted):

GPU	ASR Streams	LLM Concurrent	TTS Streams
L4	50	20-30	100
L40S	100	50-75	200
A100	100	75-100	250
H100	200+	150-200	400+

Managed service capacity: Cloud providers handle scaling automatically but enforce rate limits. Enterprise agreements unlock dedicated capacity for predictable high-volume deployments.

Geographic distribution

Voice latency includes network round-trip time. Global deployments require regional infrastructure:¹⁹

Colocation strategy: Place inference infrastructure at telephony points of presence. Telnyx colocates GPUs with telephony PoPs for ultra-low latency.

Edge deployment: Run ASR and TTS at edge locations while centralizing LLM inference. Reduces audio transport latency while maintaining model capability.

Regional failover: Maintain capacity in multiple regions for resilience. Route calls to nearest healthy region.

Microservices architecture

Production voice AI benefits from service decomposition:²⁰

                    ┌─────────────┐
                    │   Gateway   │
                    └──────┬──────┘
         ┌─────────────────┼─────────────────┐
         │                 │                 │
    ┌────▼────┐      ┌─────▼─────┐     ┌─────▼─────┐
    │   ASR   │      │    LLM    │     │    TTS    │
    │ Service │      │  Service  │     │  Service  │
    └─────────┘      └───────────┘     └───────────┘
         │                 │                 │
    ┌────▼────┐      ┌─────▼─────┐     ┌─────▼─────┐
    │   ASR   │      │    LLM    │     │    TTS    │
    │  Pods   │      │   Pods    │     │   Pods    │
    └─────────┘      └───────────┘     └───────────┘

Benefits: - Independent scaling per component - Fault isolation (ASR failure doesn't crash TTS) - Technology flexibility (different languages per service) - Resource optimization (right-size each service)

Cost modeling

Per-minute economics

Voice AI costs accumulate across the stack:²¹

Managed platform (typical):

ASR: $0.006/minute (Deepgram)
LLM: $0.02-0.10/minute (varies by model)
TTS: $0.02/minute (mid-tier)
Orchestration: $0.05/minute (Vapi)
Telephony: $0.01/minute

Total: $0.10-0.20/minute

Self-hosted (at scale):

GPU compute: $0.03-0.05/minute (amortized)
Infrastructure: $0.01/minute
Engineering overhead: Variable

Total: $0.05-0.10/minute (higher volume)

Cost optimization strategies

Model tiering: Route simple queries to cheaper models. Save expensive frontier models for complex conversations.

Caching: Cache common responses as audio. Eliminate ASR/LLM/TTS costs for cached interactions entirely.

Off-peak scheduling: Batch non-real-time voice processing during lower-cost compute windows.

Provider negotiation: High-volume deployments unlock significant discounts. Negotiate committed use agreements.

Organizations deploying voice AI infrastructure at scale can leverage Introl's global expertise for GPU deployment and optimization across 257 locations worldwide.

Production considerations

Latency monitoring

Track latency at each pipeline stage:

Key metrics: - Time-to-first-byte (TTFB) per component - End-to-end response latency (user speech end → agent speech start) - P50, P95, P99 latency distributions - Latency by geographic region

Alerting thresholds: - P95 > 800ms: Warning - P95 > 1200ms: Critical - Component TTFB > 2x baseline: Investigate

Reliability patterns

Circuit breakers: Fail fast when downstream services degrade. Route to backup providers automatically.

Graceful degradation: Fall back to simpler models or cached responses when primary services fail.

Health checks: Monitor each component independently. Remove unhealthy instances from rotation.

Compliance requirements

HIPAA (healthcare): Ensure BAAs cover all pipeline components. Voice data contains PHI.

PCI-DSS (payments): Voice recordings may contain card numbers. Implement appropriate redaction.

GDPR/CCPA: Voice biometrics constitute personal data. Implement consent and deletion workflows.

The sub-500ms imperative

Voice AI infrastructure succeeds or fails on latency. Users abandon conversations that feel sluggish. They engage with agents that respond naturally. The 500-millisecond threshold separating natural from artificial interaction drives every architectural decision.

Achieving production-grade voice AI requires optimizing each pipeline component while architecting systems that parallelize processing wherever possible. Streaming ASR feeds partial transcriptions to speculative LLM processing. Streaming TTS begins speaking before responses complete. The entire system operates as a continuous flow rather than discrete stages.

The infrastructure investment pays returns in user experience. Voice agents that respond in 400 milliseconds feel like conversations. Those responding in 1.5 seconds feel like telephone menus. For organizations building voice AI products, the latency optimization effort determines whether users adopt or abandon the technology.

The stack has matured. Deepgram delivers 150ms ASR. ElevenLabs achieves 75ms TTS. Groq serves LLMs at 200ms. The components exist to build sub-500ms voice agents. Success depends on infrastructure that orchestrates these components into seamless conversational experiences.

References

Retell AI. "AI Voice Agent Latency Face-Off 2025." 2025. https://www.retellai.com/resources/ai-voice-agent-latency-face-off-2025
TringTring.AI. "Understanding Latency in AI Voice Agents: Why Sub-500ms Matters." 2025. https://tringtring.ai/blog/technical-deep-dive/understanding-latency-in-ai-voice-agents-why-sub-500ms-matters/
AssemblyAI. "The voice AI stack for building agents in 2025." 2025. https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents
Twilio. "Core Latency in AI Voice Agents." 2025. https://www.twilio.com/en-us/blog/developers/best-practices/guide-core-latency-ai-voice-agents
Deepgram. "Speech-to-Text API." 2025. https://deepgram.com/product/speech-to-text
AssemblyAI. "Top APIs and models for real-time speech recognition and transcription in 2025." 2025. https://www.assemblyai.com/blog/best-api-models-for-real-time-speech-recognition-and-transcription
Ionio. "2025 Edge Speech-to-Text Model Benchmark: Whisper vs. Competitors." 2025. https://www.ionio.ai/blog/2025-edge-speech-to-text-model-benchmark-whisper-vs-competitors
Dell Technologies. "Benchmarking NVIDIA GPU Throughput for LLMs." 2025. https://infohub.delltechnologies.com/en-us/p/benchmarking-nvidia-gpu-throughput-for-llms-and-understanding-gpu-configuration-choices-in-the-ai-space/
ElevenLabs. "How do you optimize latency for Conversational AI?" 2025. https://elevenlabs.io/blog/how-do-you-optimize-latency-for-conversational-ai
ElevenLabs. "ElevenLabs vs. Cartesia." June 2025. https://elevenlabs.io/blog/elevenlabs-vs-cartesia
Podcastle. "Streaming TTS benchmark: Async vs Elevenlabs vs Cartesia." 2025. https://podcastle.ai/blog/tts-latency-vs-quality-benchmark/
Deepgram. "Introducing Aura-2: Enterprise-Grade Text-to-Speech." 2025. https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech
Softcery. "11 Voice Agent Platforms Compared." 2025. https://softcery.com/lab/choosing-the-right-voice-agent-platform-in-2025
Modal. "LiveKit vs Vapi: which voice AI framework is best in 2025?" 2025. https://modal.com/blog/livekit-vs-vapi-article
Retell AI. "Top 11 Vapi AI Alternatives for Conversational Voice Agents." 2025. https://www.retellai.com/blog/best-vapi-alternatives-for-enterprise-voice-ai
Brain.co. "Chained Voice Agent Architectures: Speech-to-Speech vs Chained Pipeline vs Hybrid Approaches." 2025. https://brain.co/blog/chained-voice-agent-architectures-speech-to-speech-vs-chained-pipeline-vs-hybrid-approaches
Softcery. "Real-Time vs Turn-Based Voice Agent Architecture." 2025. https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture
ZenML. "ElevenLabs: Scaling Voice AI with GPU-Accelerated Infrastructure." 2025. https://www.zenml.io/llmops-database/scaling-voice-ai-with-gpu-accelerated-infrastructure
Telnyx. "Voice AI Latency: Why colocation is mission critical." 2025. https://telnyx.com/resources/how-telnyx-fixed-voice-ai-latency-with-co-located-infrastructure
Redapt. "Empowering Scalable Voice AI with Optimized Infrastructure." 2025. https://www.redapt.com/customer-stories/empowering-scalable-voice-ai-with-optimized-infrastructure
Cerebrium. "Deploying a global scale, AI voice agent with 500ms latency." 2025. https://www.cerebrium.ai/blog/deploying-a-global-scale-ai-voice-agent-with-500ms-latency

Key takeaways

For infrastructure architects: - Latency equation: STT (200ms) + LLM (500ms) + TTS (150ms) + network (50ms) + processing (100ms) = 1,000ms typical - Sub-500ms threshold for natural conversation; users abandon at 1.2+ seconds; 300-500ms matches human response timing - GPU capacity per-device: L4 (50 ASR/20 LLM/100 TTS streams), A100 (100/75/250), H100 (200+/150/400+)

For provider selection: - ASR: Deepgram Nova-3 150ms TTFT at $0.0043/min; AssemblyAI 300-600ms but 14.5% WER (best accuracy) - TTS: ElevenLabs Flash 75ms at $0.050/1K chars (best quality); Cartesia 40-95ms at $0.038/1K (latency-critical) - Orchestration: Vapi $0.05/min (fastest setup), LiveKit (open source, self-hostable), Retell (HIPAA/SOC 2)

For finance teams: - Managed platform: $0.10-0.20/minute total (ASR + LLM + TTS + orchestration + telephony) - Self-hosted at scale: $0.05-0.10/minute (GPU compute $0.03-0.05/min amortized) - S2S models (GPT-4o voice): $0.30-1.50/minute for longer conversations; cascading ~$0.15/min regardless of length

For operations teams: - Streaming architecture mandatory: begin TTS as LLM tokens arrive, process partial transcriptions before speech ends - Model routing saves cost: simple queries to Gemini Flash (300ms), complex to GPT-4o (700ms) - Alerting: P95 > 800ms warning, P95 > 1200ms critical, component TTFB > 2x baseline investigate

The voice AI stack

The latency equation

Speech-to-text infrastructure

Provider comparison

ASR infrastructure patterns

GPU requirements for self-hosted ASR

Large language model layer

Model latency profiles

Optimization strategies

Text-to-speech infrastructure

Provider comparison

TTS infrastructure patterns

Orchestration platforms

Platform comparison

Selection criteria

Architecture patterns

Cascading pipeline (ASR → LLM → TTS)

Speech-to-speech (S2S)

Hybrid approaches

Scaling voice AI infrastructure

Concurrent capacity planning

Geographic distribution

Microservices architecture

Cost modeling

Per-minute economics

Cost optimization strategies

Production considerations

Latency monitoring

Reliability patterns

Compliance requirements

The sub-500ms imperative

References

Key takeaways

You Might Also Like

Immersion Cooling ROI Calculator: 2-4 Year Payback for AI Wo...

UK AI Corridor: London's Emerging Compute Hub

vLLM Production Deployment: Building High-Throughput Inferen...

Request a Quote_

Request Received_