RAG Infrastructure: Building Production Retrieval-Augmented Generation Systems

Harvey AI serves 97% of the Am Law 100 law firms using retrieval-augmented generation to ground legal research in actual case law rather than hallucinated citations.¹ Anthropic, OpenAI, and Google

Blake Crosley

Mar 25, 2026 16 min read Disclaimer

RAG Infrastructure: Building Production Retrieval-Augmented Generation Systems

December 2025 Update: RAG adoption accelerating as enterprise LLM use case #1. GraphRAG and agentic RAG architectures gaining traction for complex reasoning. Vector database market consolidating around Pinecone, Weaviate, Milvus, and Qdrant. Voyage-3-large outperforming OpenAI and Cohere embeddings by 9-20%. Semantic chunking improving recall up to 9% over fixed-size approaches. Production challenges shifting from prototypes to scale—embedding drift, multi-tenancy, and sub-50ms latency requirements driving infrastructure investment.

Harvey AI serves 97% of the Am Law 100 law firms using retrieval-augmented generation to ground legal research in actual case law rather than hallucinated citations.¹ Anthropic, OpenAI, and Google all recommend RAG as the primary technique for connecting large language models to proprietary enterprise data. Yet the gap between a working RAG prototype and production-grade infrastructure spans months of engineering effort. Organizations discover that vector databases, embedding pipelines, chunking strategies, and retrieval optimization each present distinct infrastructure challenges that compound at scale. Building RAG systems that handle millions of documents, serve thousands of concurrent users, and maintain sub-second latency requires architectural decisions that few teams anticipate during proof-of-concept phases.

The core architecture every production RAG system requires

RAG systems combine two fundamental capabilities: retrieving relevant context from a knowledge base and generating responses grounded in that context. The architecture breaks into five distinct components, each with specific infrastructure requirements.

Document ingestion pipelines handle the flow from raw documents to searchable embeddings. Production systems process PDFs, HTML, Word documents, Slack messages, and database records through format-specific parsers. Ingestion pipelines must track document versions, handle incremental updates, and maintain metadata for filtering. Typical enterprise deployments process 100,000 to 10 million documents during initial backfill, with daily incremental loads of 1,000 to 50,000 new documents.²

Chunking systems divide documents into retrieval-friendly segments. Fixed-size chunking works for homogeneous content like news articles, while semantic chunking preserves meaning boundaries for complex documents.³ Most production systems use recursive chunking with 400-512 tokens and 10-20% overlap, achieving 85-90% recall in benchmark tests.⁴ Chunking strategy selection becomes semi-permanent—changing approaches later requires re-embedding the entire corpus.

Embedding infrastructure converts text chunks into dense vector representations. Organizations choose between managed APIs (OpenAI, Cohere, Voyage AI) and self-hosted models. Embedding generation creates the most variable cost structure in RAG systems, with pricing ranging from $0.02 to $0.18 per million tokens depending on model selection.⁵ Batch processing parallelizes embedding generation across GPU nodes for initial loads, while streaming pipelines handle incremental updates.

Vector databases store and retrieve embeddings using approximate nearest neighbor algorithms. The four dominant options—Pinecone, Weaviate, Milvus, and Qdrant—serve different operational profiles. Pinecone offers zero-ops managed service, Weaviate provides hybrid search with knowledge graph capabilities, Milvus handles billion-scale deployments, and Qdrant excels at complex metadata filtering.⁶ Storage requirements scale with embedding dimension and document count; a 10-million document corpus with 1024-dimensional embeddings requires approximately 40GB of vector storage.

Retrieval and generation orchestration ties the components together, typically using frameworks like LangChain, LlamaIndex, or custom implementations. Orchestration handles query processing, retrieval, reranking, prompt construction, and response generation. Production systems implement caching layers, fallback strategies, and observability instrumentation at each stage.

Vector database selection determines operational complexity

The vector database market consolidated around four major players by December 2025, each serving distinct operational profiles and use cases.

Pinecone dominates the managed-service segment, handling infrastructure entirely behind their API. Teams deploy production systems in hours rather than weeks, with automatic scaling, multi-region replication, and SOC 2 compliance included. Pinecone supports up to 40KB metadata per vector, enabling rich filtering without external systems. The tradeoff involves higher per-query costs and reduced control over infrastructure optimization. Organizations running predictable workloads often find Pinecone cost-effective; those with highly variable traffic or extreme scale requirements typically migrate to alternatives.⁷

Weaviate bridges open-source flexibility with managed convenience through Weaviate Cloud. The system combines vector search with knowledge graph capabilities, enabling hybrid queries that filter on structured data while ranking by semantic similarity. Weaviate's modular architecture supports multiple embedding models simultaneously, useful for organizations experimenting with different approaches. Docker and Kubernetes deployments require modest operational expertise, making Weaviate popular among teams with some infrastructure capability.⁸

Milvus (and its managed counterpart Zilliz Cloud) targets billion-scale deployments with performance as the primary design goal. Milvus leads benchmarks in raw latency, achieving sub-10ms query times on billion-vector indices through GPU acceleration and advanced indexing algorithms.⁹ The architecture separates compute and storage, enabling independent scaling of each layer. Operating Milvus requires significant data engineering expertise—teams without dedicated infrastructure personnel often struggle with cluster management and performance tuning.

Qdrant gained rapid adoption for complex filtering requirements. Built in Rust, Qdrant executes payload filtering directly within the search algorithm rather than as post-processing, delivering superior performance for filtered queries.¹⁰ The compact resource footprint makes Qdrant popular for cost-sensitive deployments, while its crisp API design accelerates development velocity. Self-hosted deployments run smoothly on modest infrastructure, though enterprise features require commercial licensing.

Selection criteria should prioritize operational capability first. Teams needing zero-ops choose Pinecone or Weaviate Cloud. Organizations with SRE capacity comfortable with stateful Kubernetes workloads gain cost savings and control from self-hosted Milvus, Qdrant, or Weaviate. Compliance requirements sometimes eliminate options—Pinecone and Weaviate Cloud offer SOC 2 and HIPAA compliance, while on-premise mandates require self-hosted solutions.

Embedding model selection affects both cost and retrieval quality

Embedding models convert text into vector representations, and model selection directly impacts retrieval accuracy. The December 2025 landscape offers three leading commercial options plus several strong open-source alternatives.

Voyage AI leads MTEB benchmarks, with voyage-3-large outperforming OpenAI text-embedding-3-large by 9.74% and Cohere embed-v3-english by 20.71% across evaluated domains.¹¹ Voyage AI supports 32K-token context windows (compared to 8K for OpenAI and 512 for older Cohere models), enabling processing of longer documents without chunking. The 1024-dimensional embeddings cost $0.06 per million tokens—2.2x cheaper than OpenAI and 1.6x cheaper than Cohere—while requiring 3x less vector storage than OpenAI's 3072-dimensional embeddings.

OpenAI text-embedding-3-large offers the most battle-tested option for production deployments. The model supports configurable output dimensions from 256 to 3072, enabling cost-storage tradeoffs. At $0.13 per million tokens, OpenAI sits in the middle of the pricing spectrum while providing reliable uptime and extensive documentation. Organizations already using OpenAI's inference APIs often standardize on their embeddings for operational simplicity.

Cohere embed-v4 achieved the highest MTEB score (65.2) as of November 2025, optimized specifically for search and retrieval rather than general-purpose embedding.¹² Cohere embeddings pair naturally with Cohere's reranker for two-stage retrieval pipelines. The model excels at multilingual applications, supporting over 100 languages with strong cross-lingual retrieval.

Open-source alternatives including BGE, E5, and GTE models enable self-hosted embedding at scale. Organizations processing billions of documents often deploy these models on internal GPU infrastructure to eliminate per-token costs. Self-hosting requires managing model updates, capacity planning, and inference optimization—tradeoffs that make sense only at significant scale.

The embedding model decision cascades through the entire system. Changing models later requires re-embedding the complete document corpus, a process that costs time, compute, and potentially service disruption. Production systems should evaluate models against domain-specific benchmarks rather than relying on generic MTEB scores. A model excelling at general knowledge may underperform on legal, medical, or financial text.

Chunking strategies determine retrieval precision

Document chunking creates the atomic units that the retrieval system searches. Chunking strategy selection ranks among the most consequential infrastructure decisions, with potential 9% recall variation between best and worst approaches.¹³

Fixed-size chunking splits documents at predetermined token counts regardless of content structure. The approach works well for homogeneous corpora—news articles, product descriptions, or standardized documents. Implementation requires minimal complexity, making fixed-size chunking the natural starting point for prototypes. Most production systems use 400-512 token chunks with 50-100 token overlaps, balancing retrieval granularity against context preservation.

Semantic chunking divides documents at meaningful boundaries—paragraph breaks, section headers, or thematic shifts—preserving coherent ideas within each chunk. Implementation uses sentence embeddings to detect semantic boundaries, splitting when similarity between adjacent sentences drops below a threshold. Semantic chunking improves recall by up to 9% for narrative content like documentation, FAQs, and conversational data.¹⁴ The approach requires more compute during ingestion and careful tuning of similarity thresholds.

Recursive chunking applies hierarchical splitting rules, first attempting large splits (section breaks), then progressively smaller ones (paragraph breaks, sentence breaks) until chunks reach target size. LangChain's RecursiveCharacterTextSplitter implements this pattern, achieving strong performance across diverse document types without per-corpus tuning. Recursive chunking balances implementation simplicity against retrieval quality, making it the default recommendation for new systems.

Page-level chunking emerged from NVIDIA benchmarks showing 0.648 accuracy with lowest variance across document types.¹⁵ For structured documents like reports and papers, treating each page as a chunk preserves spatial relationships and cross-references. Page-level approaches work poorly for documents lacking clear page boundaries (HTML, chat logs, code) but excel for PDF-heavy corpora.

Hierarchical chunking builds multi-level indexes with nested granularity—section, subsection, paragraph, and sentence levels. Retrieval first identifies relevant sections, then drills into specific paragraphs within those sections. The approach enables both precise answers and broad context retrieval but increases storage requirements and query complexity.

Production systems should evaluate chunking strategies against representative queries before committing. Building small test indices with each strategy and measuring recall on known-good query sets reveals strategy-specific performance characteristics. Changing chunking approaches post-deployment requires complete re-ingestion and re-embedding.

Scaling challenges emerge at production volumes

The leap from prototype to production exposes infrastructure bottlenecks that rarely surface during development. Organizations deploying RAG at scale encounter predictable challenges requiring specific mitigation strategies.

Embedding throughput often constrains initial deployment timelines. A 10-million document corpus at 500 tokens per chunk requires embedding 5 billion tokens—approximately $650 using OpenAI or $300 using Voyage AI, plus compute time. Backfill operations should parallelize across multiple API keys or GPU nodes, processing documents in batches of 100-1000. Incremental updates use streaming pipelines, embedding new and modified documents as they arrive.¹⁶

Vector database scaling follows different patterns than traditional databases. Index saturation becomes problematic as in-memory indices exceed available RAM, forcing disk access and latency degradation. Sharding strategies partition vectors by topic, time window, or hash value to distribute load across nodes. Milvus and Qdrant support horizontal scaling through native clustering; Pinecone handles scaling automatically behind its managed service.

Query latency requirements typically target sub-100ms for interactive applications. Real-time user experiences demand sub-50ms latency, requiring careful optimization of the entire retrieval path. Smart caching using Redis or Memcached eliminates redundant vector searches for repeated or similar queries. Pre-computing embeddings for common query patterns further reduces latency. Geographic distribution through multi-region deployments minimizes network round-trip times.¹⁷

Embedding drift degrades retrieval quality over time when embedding models change. Organizations updating embedding models must re-embed their entire corpus to maintain consistency—older embeddings become incompatible with new query embeddings. Production systems track embedding model versions per document, enabling gradual migration and rollback if needed.

Multi-tenancy introduces isolation requirements for SaaS deployments. Tenants expect their documents to remain invisible to other users, requiring filtering at both ingestion and query time. Metadata-based filtering adds tenant IDs to every vector, but large tenants can overwhelm shared indices. Separate indices per tenant provide stronger isolation at the cost of management complexity and reduced resource sharing efficiency.

Cost optimization becomes critical at scale. Embedding costs grow linearly with document volume, while vector storage costs grow with document count and embedding dimension. Quantization reduces storage requirements by 4-8x with minimal accuracy loss—8-bit quantized embeddings often outperform full-precision embeddings from weaker models.¹⁸ Tiered storage strategies keep hot documents in high-performance indices while archiving cold content to cheaper storage.

Advanced architectures extend basic RAG capabilities

Production systems increasingly adopt architectural patterns that enhance retrieval quality and enable new capabilities beyond simple vector search.

Hybrid search combines dense vector retrieval with sparse keyword matching. Vector search excels at semantic similarity, finding relevant content even when queries use different terminology than documents. Keyword search provides precision for specific terms, acronyms, and proper nouns that embedding models may not capture. Weaviate and Vespa support hybrid search natively, blending BM25 scores with vector similarity. Most production systems weight hybrid results 70-80% toward vectors, 20-30% toward keywords.¹⁹

Reranking adds a second-stage model that scores retrieved candidates for relevance. Initial vector search returns 50-100 candidates quickly; a cross-encoder model then evaluates each candidate against the original query. Cohere Rerank, BGE Reranker, and similar models improve precision by 10-30% compared to vector search alone. The additional latency (50-100ms for 50 candidates) trades against accuracy gains, making reranking appropriate for quality-sensitive applications.

GraphRAG augments vector retrieval with knowledge graph relationships. Microsoft's GraphRAG implementation builds entity graphs during ingestion, then traverses relationships during retrieval to surface indirectly related content.²⁰ A query about a company's leadership can retrieve information about board members through relationship edges even without explicit mentions. GraphRAG achieves up to 99% search precision for structured domains but requires significant infrastructure for graph construction and maintenance.

Agentic RAG wraps retrieval in an autonomous agent loop that can reformulate queries, filter results, and iterate until finding satisfactory answers. Rather than single-shot retrieval, agentic systems execute multi-step reasoning—decomposing complex queries into sub-queries, synthesizing results from multiple retrievals, and verifying answers against source documents. Production agentic RAG requires robust error handling and cost controls to prevent runaway query chains.

Self-RAG and Corrective RAG implement self-correction mechanisms that verify retrieved content before generation. The system evaluates whether retrieved documents actually answer the query, potentially triggering additional retrieval rounds or acknowledging when the knowledge base lacks relevant information.²¹ These architectures reduce hallucination rates but add latency and complexity.

Operational best practices for production deployments

Maintaining production RAG systems requires observability, monitoring, and operational procedures distinct from traditional applications.

Retrieval quality monitoring tracks whether the system returns relevant documents for user queries. Production systems log query embeddings, retrieved document IDs, and user feedback signals (clicks, thumbs up/down, reformulated queries). Declining retrieval quality often indicates embedding drift, corpus staleness, or distribution shift in query patterns. Weekly quality reviews comparing current retrieval against baseline queries catch degradation before users notice.

Latency monitoring instruments each pipeline stage—query embedding, vector search, reranking, and generation. P50, P95, and P99 latency metrics by stage identify bottlenecks. Vector database query latency increasing over time indicates index fragmentation or capacity constraints. Embedding latency spikes suggest API throttling or model degradation.

Cost tracking allocates spending to embedding generation, vector storage, inference, and orchestration. Embedding costs often surprise teams, as high-traffic applications can generate millions of query embeddings monthly. Caching strategies reduce embedding costs by 40-60% for applications with query repetition. Storage costs grow with retention policies—archiving old document versions to cold storage reduces expenses.

Document freshness management ensures the knowledge base reflects current information. Stale documents generate incorrect answers, frustrating users and damaging trust. Automated freshness checks flag documents exceeding age thresholds or detect when source systems update content. Re-ingestion pipelines process updated documents within hours of changes.

Failover and recovery procedures handle component failures gracefully. Vector database outages should trigger fallback to keyword search rather than complete service failure. Embedding API outages can use cached embeddings for known queries while queuing new queries for retry. Regular disaster recovery testing validates that backup indices can serve production traffic.

Organizations deploying RAG across multiple applications benefit from centralized platform teams managing shared infrastructure. Introl's field engineers have deployed RAG infrastructure across our global coverage area, helping organizations scale from proof-of-concept to production-grade systems serving millions of queries daily.

Making RAG infrastructure decisions

RAG infrastructure decisions involve tradeoffs between cost, complexity, latency, and retrieval quality. The optimal architecture depends on organizational constraints and application requirements.

Start with managed services for initial deployments. Pinecone or Weaviate Cloud eliminates operational complexity, letting teams focus on retrieval quality and application integration. OpenAI or Voyage AI embeddings provide reliable quality without GPU management. LangChain or LlamaIndex handles orchestration with minimal custom code.

Graduate to hybrid approaches as scale increases. Self-hosted embedding models reduce per-query costs for high-volume applications. Self-managed vector databases offer control over performance optimization and data residency. Custom orchestration enables application-specific optimizations.

Optimize chunking and embedding strategies before optimizing infrastructure. Poor chunking undermines expensive vector databases; weak embedding models waste fast hardware. Systematic evaluation against domain-specific benchmarks identifies the highest-impact improvements.

Plan for evolution from the beginning. Lock in flexible contracts with vector database providers. Design abstraction layers that isolate application code from infrastructure choices. Track embedding model versions to enable migration when better models emerge.

The RAG infrastructure landscape continues evolving rapidly. GraphRAG, agentic architectures, and improved embedding models will reshape best practices over the coming year. Organizations building production systems today should design for adaptability while optimizing for current requirements. The investment in solid RAG infrastructure pays dividends through reduced hallucination, improved user trust, and AI applications that genuinely understand organizational knowledge.

References

Harvey AI. "Enterprise-Grade RAG Systems." Harvey AI Blog, 2025. https://www.harvey.ai/blog/enterprise-grade-rag-systems
Zilliz. "Infrastructure Challenges in Scaling RAG with Custom AI Models." Zilliz Blog, 2024. https://zilliz.com/blog/infrastructure-challenges-in-scaling-rag-with-custom-ai-models
Multimodal.dev. "Semantic Chunking for RAG: Better Context, Better Results." Multimodal.dev, 2024. https://www.multimodal.dev/post/semantic-chunking-for-rag
Weaviate. "Chunking Strategies for RAG." Weaviate Blog, 2024. https://weaviate.io/blog/chunking-strategies-for-rag
Elephas. "13 Best Embedding Models in 2025: OpenAI vs Voyage AI vs Ollama." Elephas Blog, 2025. https://elephas.app/blog/best-embedding-models
LiquidMetal AI. "Vector Database Comparison: Pinecone vs Weaviate vs Qdrant vs FAISS vs Milvus vs Chroma (2025)." LiquidMetal AI, 2025. https://liquidmetal.ai/casesAndBlogs/vector-comparison/
Pinecone. "Pinecone Documentation." Pinecone, 2025. https://docs.pinecone.io
Weaviate. "Weaviate Documentation." Weaviate, 2025. https://weaviate.io/developers/weaviate
Milvus. "How do I choose between Pinecone, Weaviate, Milvus, and other vector databases?" Milvus Documentation, 2025. https://milvus.io/ai-quick-reference/how-do-i-choose-between-pinecone-weaviate-milvus-and-other-vector-databases
Qdrant. "Qdrant Documentation." Qdrant, 2025. https://qdrant.tech/documentation/
Voyage AI. "voyage-3-large: the new state-of-the-art general-purpose embedding model." Voyage AI Blog, January 2025. https://blog.voyageai.com/2025/01/07/voyage-3-large/
Cohere. "Cohere Embed v4." Cohere Documentation, 2025. https://docs.cohere.com/docs/embed-v4
Firecrawl. "Best Chunking Strategies for RAG in 2025." Firecrawl Blog, 2025. https://www.firecrawl.dev/blog/best-chunking-strategies-rag-2025
Databricks. "Mastering Chunking Strategies for RAG: Best Practices & Code Examples." Databricks Community, 2024. https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089
NVIDIA. "Scaling Enterprise RAG with Accelerated Ethernet Networking and Networked Storage." NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/scaling-enterprise-rag-with-accelerated-ethernet-networking-and-networked-storage/
Cohesity. "Scaling RAG systems for the enterprise." Cohesity Blog, 2024. https://www.cohesity.com/blogs/scaling-retrieval-augmented-generation-systems-for-enterprises/
Equinix. "What Is Retrieval-Augmented Generation (RAG), and Where Should You Do It?" Equinix Interconnections Blog, April 2025. https://blog.equinix.com/blog/2025/04/30/what-is-retrieval-augmented-generation-rag-and-where-should-you-do-it/
NexGenCloud. "Enterprise RAG at Scale: Why Businesses Can't Afford to Stay Small." NexGenCloud Blog, 2025. https://www.nexgencloud.com/blog/thought-leadership/enterprise-rag-at-scale-why-businesses-can-t-afford-to-stay-small
Weaviate. "Hybrid Search." Weaviate Documentation, 2025. https://weaviate.io/developers/weaviate/search/hybrid
Microsoft. "GraphRAG: Unlocking LLM discovery on narrative private data." Microsoft Research Blog, 2024. https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Squirro. "RAG in 2025: Bridging Knowledge and Generative AI." Squirro Blog, 2025. https://squirro.com/squirro-blog/state-of-rag-genai

Key takeaways

For ML engineers: - Harvey AI serves 97% of Am Law 100 with RAG; GraphRAG achieves up to 99% search precision for structured domains - Voyage-3-large outperforms OpenAI embeddings by 9.74% and Cohere by 20.71%; supports 32K-token context vs 8K for OpenAI - Semantic chunking improves recall up to 9% over fixed-size; reranking adds 10-30% precision improvement with 50-100ms latency cost

For infrastructure architects: - Vector DB selection: Pinecone (zero-ops managed), Weaviate (hybrid search + graphs), Milvus (billion-scale sub-10ms), Qdrant (complex filtering) - 10M document corpus at 1024-dimensional embeddings requires ~40GB vector storage; Milvus leads benchmarks on billion-vector indices - Production latency targets: sub-100ms interactive, sub-50ms real-time; Redis/Memcached caching eliminates redundant vector searches

For finance teams: - Embedding costs: Voyage AI $0.06/M tokens, OpenAI $0.13/M, Cohere varies; 10M docs at 500 tokens costs $300-650 to embed - Caching strategies reduce embedding costs 40-60%; quantization reduces storage 4-8x with minimal accuracy loss - Cost optimization critical at scale: embedding drift requires re-embedding entire corpus when models change

For operations teams: - Chunking strategy is semi-permanent—changing requires re-embedding entire corpus; evaluate against domain-specific benchmarks first - Embedding drift degrades quality over time; track model versions per document for migration and rollback capability - Multi-tenancy requires metadata filtering or separate indices per tenant; weekly quality reviews catch degradation before users notice

The core architecture every production RAG system requires

Vector database selection determines operational complexity

Embedding model selection affects both cost and retrieval quality

Chunking strategies determine retrieval precision

Scaling challenges emerge at production volumes

Advanced architectures extend basic RAG capabilities

Operational best practices for production deployments

Making RAG infrastructure decisions

References

Key takeaways

You Might Also Like

CPUs for AI infrastructure: AMD EPYC, Intel Xeon, and NVIDIA...

Documentation Best Practices for AI Infrastructure: Knowledg...

Troubleshooting GPU Clusters: Common Issues and Resolution P...

Request a Quote_

Request Received_