RAG इंफ्रास्ट्रक्चर: प्रोडक्शन Retrieval-Augmented Generation सिस्टम का निर्माण

एंटरप्राइज LLM उपयोग केस #1 के रूप में RAG अपनाने में तेजी। जटिल रीजनिंग के लिए GraphRAG और agentic RAG आर्किटेक्चर को मिल रही लोकप्रियता। Vector database मार्केट Pinecone, Weaviate,...

Blake Crosley

Mar 25, 2026 10 min read Disclaimer

RAG इंफ्रास्ट्रक्चर: प्रोडक्शन Retrieval-Augmented Generation सिस्टम का निर्माण

8 दिसंबर, 2025 को अपडेट किया गया

दिसंबर 2025 अपडेट: एंटरप्राइज LLM उपयोग केस #1 के रूप में RAG अपनाने में तेजी आ रही है। जटिल रीजनिंग के लिए GraphRAG और agentic RAG आर्किटेक्चर को लोकप्रियता मिल रही है। Vector database मार्केट Pinecone, Weaviate, Milvus, और Qdrant के आसपास समेकित हो रहा है। Voyage-3-large OpenAI और Cohere embeddings को 9-20% से बेहतर प्रदर्शन कर रहा है। Semantic chunking fixed-size approaches की तुलना में recall में 9% तक सुधार कर रही है। प्रोडक्शन चुनौतियां प्रोटोटाइप से स्केल की ओर बढ़ रही हैं—embedding drift, multi-tenancy, और sub-50ms latency आवश्यकताएं इंफ्रास्ट्रक्चर निवेश को बढ़ावा दे रही हैं।

Harvey AI Am Law 100 की 97% लॉ फर्मों को सेवा प्रदान करता है, जो retrieval-augmented generation का उपयोग करके कानूनी शोध को hallucinated citations के बजाय वास्तविक केस लॉ पर आधारित करता है।¹ Anthropic, OpenAI, और Google सभी RAG को बड़े language models को proprietary enterprise data से जोड़ने की प्राथमिक तकनीक के रूप में अनुशंसित करते हैं। फिर भी एक काम करने वाले RAG प्रोटोटाइप और प्रोडक्शन-ग्रेड इंफ्रास्ट्रक्चर के बीच का अंतर महीनों के इंजीनियरिंग प्रयास तक फैला है। संगठनों को पता चलता है कि vector databases, embedding pipelines, chunking strategies, और retrieval optimization प्रत्येक अलग-अलग इंफ्रास्ट्रक्चर चुनौतियां प्रस्तुत करते हैं जो स्केल पर और जटिल हो जाती हैं। RAG सिस्टम बनाना जो लाखों दस्तावेजों को संभाले, हजारों concurrent users को सेवा दे, और sub-second latency बनाए रखे—इसके लिए ऐसे आर्किटेक्चरल निर्णय चाहिए जिनकी कुछ टीमें proof-of-concept चरणों के दौरान अपेक्षा नहीं करतीं।

मूल आर्किटेक्चर जो हर प्रोडक्शन RAG सिस्टम को चाहिए

RAG सिस्टम दो मौलिक क्षमताओं को जोड़ते हैं: knowledge base से relevant context को retrieve करना और उस context पर आधारित responses generate करना। आर्किटेक्चर पांच अलग-अलग घटकों में बंटता है, प्रत्येक की विशिष्ट इंफ्रास्ट्रक्चर आवश्यकताएं हैं।

Document ingestion pipelines raw documents से searchable embeddings तक के प्रवाह को संभालती हैं। प्रोडक्शन सिस्टम PDFs, HTML, Word documents, Slack messages, और database records को format-specific parsers के माध्यम से प्रोसेस करते हैं। Ingestion pipelines को document versions को track करना, incremental updates को handle करना, और filtering के लिए metadata बनाए रखना होता है। विशिष्ट enterprise deployments initial backfill के दौरान 100,000 से 10 million documents प्रोसेस करते हैं, प्रतिदिन 1,000 से 50,000 नए documents के incremental loads के साथ।²

Chunking systems documents को retrieval-friendly segments में विभाजित करते हैं। Fixed-size chunking news articles जैसे homogeneous content के लिए काम करती है, जबकि semantic chunking जटिल documents के लिए meaning boundaries को preserve करती है।³ अधिकांश प्रोडक्शन सिस्टम 400-512 tokens और 10-20% overlap के साथ recursive chunking का उपयोग करते हैं, benchmark tests में 85-90% recall प्राप्त करते हैं।⁴ Chunking strategy का चयन semi-permanent बन जाता है—बाद में approaches बदलने के लिए पूरे corpus को re-embed करना पड़ता है।

Embedding infrastructure text chunks को dense vector representations में बदलता है। संगठन managed APIs (OpenAI, Cohere, Voyage AI) और self-hosted models के बीच चुनाव करते हैं। Embedding generation RAG systems में सबसे variable cost structure बनाती है, model selection के आधार पर $0.02 से $0.18 प्रति million tokens तक की pricing के साथ।⁵ Batch processing initial loads के लिए GPU nodes पर embedding generation को parallelize करती है, जबकि streaming pipelines incremental updates को handle करती हैं।

Vector databases approximate nearest neighbor algorithms का उपयोग करके embeddings को store और retrieve करते हैं। चार प्रमुख विकल्प—Pinecone, Weaviate, Milvus, और Qdrant—अलग-अलग operational profiles की सेवा करते हैं। Pinecone zero-ops managed service प्रदान करता है, Weaviate knowledge graph capabilities के साथ hybrid search प्रदान करता है, Milvus billion-scale deployments को handle करता है, और Qdrant complex metadata filtering में उत्कृष्ट है।⁶ Storage requirements embedding dimension और document count के साथ scale होती हैं; 1024-dimensional embeddings के साथ 10-million document corpus को लगभग 40GB vector storage की आवश्यकता होती है।

Retrieval और generation orchestration घटकों को एक साथ जोड़ता है, आमतौर पर LangChain, LlamaIndex, या custom implementations जैसे frameworks का उपयोग करते हुए। Orchestration query processing, retrieval, reranking, prompt construction, और response generation को handle करता है। प्रोडक्शन सिस्टम प्रत्येक चरण पर caching layers, fallback strategies, और observability instrumentation implement करते हैं।

Vector database का चयन operational complexity निर्धारित करता है

दिसंबर 2025 तक vector database market चार प्रमुख players के आसपास समेकित हो गया, प्रत्येक अलग-अलग operational profiles और use cases की सेवा करता है।

Pinecone managed-service segment पर हावी है, infrastructure को पूरी तरह से अपने API के पीछे handle करता है। टीमें हफ्तों के बजाय घंटों में प्रोडक्शन सिस्टम deploy करती हैं, automatic scaling, multi-region replication, और SOC 2 compliance शामिल होने के साथ। Pinecone प्रति vector 40KB तक metadata को support करता है, external systems के बिना rich filtering को सक्षम करता है। Tradeoff में higher per-query costs और infrastructure optimization पर reduced control शामिल हैं। Predictable workloads चलाने वाले संगठनों को अक्सर Pinecone cost-effective लगता है; highly variable traffic या extreme scale requirements वाले आमतौर पर alternatives पर migrate करते हैं।⁷

Weaviate Weaviate Cloud के माध्यम से open-source flexibility को managed convenience के साथ bridge करता है। सिस्टम vector search को knowledge graph capabilities के साथ combine करता है, hybrid queries को enable करता है जो structured data पर filter करते हुए semantic similarity से rank करती हैं। Weaviate का modular architecture एक साथ कई embedding models को support करता है, विभिन्न approaches के साथ experiment कर रहे संगठनों के लिए उपयोगी। Docker और Kubernetes deployments को modest operational expertise की आवश्यकता होती है, जिससे Weaviate कुछ infrastructure capability वाली टीमों में लोकप्रिय है।⁸

Milvus (और इसका managed counterpart Zilliz Cloud) primary design goal के रूप में performance के साथ billion-scale deployments को target करता है। Milvus GPU acceleration और advanced indexing algorithms के माध्यम से billion-vector indices पर sub-10ms query times प्राप्त करते हुए raw latency में benchmarks lead करता है।⁹ Architecture compute और storage को separate करता है, प्रत्येक layer की independent scaling को enable करता है। Milvus operate करने के लिए significant data engineering expertise की आवश्यकता होती है—dedicated infrastructure personnel के बिना टीमें अक्सर cluster management और performance tuning के साथ संघर्ष करती हैं।

Qdrant ने complex filtering requirements के लिए तेजी से adoption प्राप्त किया। Rust में built, Qdrant post-processing के बजाय search algorithm के भीतर सीधे payload filtering execute करता है, filtered queries के लिए superior performance deliver करता है।¹⁰ Compact resource footprint Qdrant को cost-sensitive deployments के लिए लोकप्रिय बनाता है, जबकि इसका crisp API design development velocity को accelerate करता है। Self-hosted deployments modest infrastructure पर smoothly चलते हैं, हालांकि enterprise features के लिए commercial licensing की आवश्यकता होती है।

Selection criteria को पहले operational capability को prioritize करना चाहिए। Zero-ops की जरूरत वाली टीमें Pinecone या Weaviate Cloud चुनती हैं। Stateful Kubernetes workloads के साथ comfortable SRE capacity वाले संगठन self-hosted Milvus, Qdrant, या Weaviate से cost savings और control प्राप्त करते हैं। Compliance requirements कभी-कभी options को eliminate करती हैं—Pinecone और Weaviate Cloud SOC 2 और HIPAA compliance प्रदान करते हैं, जबकि on-premise mandates के लिए self-hosted solutions की आवश्यकता होती है।

Embedding model का चयन cost और retrieval quality दोनों को प्रभावित करता है

Embedding models text को vector representations में convert करते हैं, और model selection सीधे retrieval accuracy को impact करता है। दिसंबर 2025 का landscape तीन leading commercial options plus कई strong open-source alternatives प्रदान करता है।

Voyage AI MTEB benchmarks में lead करता है, voyage-3-large evaluated domains में OpenAI text-embedding-3-large को 9.74% और Cohere embed-v3-english को 20.71% से outperform करता है।¹¹ Voyage AI 32K-token context windows support करता है (OpenAI के 8K और older Cohere models के 512 की तुलना में), chunking के बिना longer documents की processing को enable करता है। 1024-dimensional embeddings की cost $0.06 प्रति million tokens है—OpenAI से 2.2x सस्ता और Cohere से 1.6x सस्ता—जबकि OpenAI के 3072-dimensional embeddings की तुलना में 3x कम vector storage की आवश्यकता होती है।

OpenAI text-embedding-3-large production deployments के लिए सबसे battle-tested option प्रदान करता है। Model 256 से 3072 तक configurable output dimensions support करता है, cost-storage tradeoffs को enable करता है। $0.13 प्रति million tokens पर, OpenAI pricing spectrum के middle में बैठता है जबकि reliable uptime और extensive documentation प्रदान करता है। पहले से OpenAI के inference APIs का उपयोग करने वाले संगठन अक्सर operational simplicity के लिए उनकी embeddings पर standardize करते हैं।

Cohere embed-v4 ने नवंबर 2025 तक highest MTEB score (65.2) प्राप्त किया, विशेष रूप से general-purpose embedding के बजाय search और retrieval के लिए optimized।¹² Cohere embeddings स्वाभाविक रूप से two-stage retrieval pipelines के लिए Cohere के reranker के साथ pair होती हैं। Model multilingual applications में excel करता है, strong cross-lingual retrieval के साथ 100 से अधिक languages को support करता है।

Open-source alternatives जिनमें BGE, E5, और GTE models शामिल हैं, scale पर self-hosted embedding को enable करते हैं। Billions of documents process करने वाले संगठन अक्सर per-token costs को eliminate करने के लिए internal GPU infrastructure पर इन models को deploy करते हैं। Self-hosting के लिए model updates, capacity planning, और inference optimization manage करना आवश्यक है—tradeoffs जो केवल significant scale पर समझ में आते हैं।

Embedding model का निर्णय पूरे सिस्टम में cascade होता है। बाद में models बदलने के लिए complete document corpus को re-embed करना पड़ता है, एक ऐसी प्रक्रिया जिसमें time, compute, और संभावित रूप से service disruption की cost होती है। प्रोडक्शन systems को generic MTEB scores पर निर्भर रहने के बजाय domain-specific benchmarks के खिलाफ models का मूल्यांकन करना चाहिए। General knowledge में excel करने वाला model legal, medical, या financial text पर underperform कर सकता है।

Chunking strategies retrieval precision निर्धारित करती हैं

Document chunking उन atomic units को create करती है जिन्हें retrieval system search करता है। Chunking strategy का चयन सबसे consequential infrastructure decisions में से एक है, best और worst approaches के बीच potential 9% recall variation के साथ।¹³

Fixed-size chunking content structure की परवाह किए बिना predetermined token counts पर documents को split करती है। Approach homogeneous corpora के लिए अच्छी तरह काम करती है—news articles, product descriptions, या standardized documents। Implementation को minimal complexity की आवश्यकता होती है, जिससे fixed-size chunking prototypes के लिए natural starting point बन जाती है। अधिकांश production systems 50-100 token overlaps के साथ 400-512 token chunks का उपयोग करते हैं, retrieval granularity और context preservation के बीच balance करते हुए।

Semantic chunking documents को meaningful boundaries पर divide करती है—paragraph breaks, section headers, या thematic shifts—प्रत्येक chunk के भीतर coherent ideas को preserve करते हुए। Implementation semantic boundaries detect करने के लिए sentence embeddings का उपयोग करती है, जब adjacent sentences के बीच similarity threshold से नीचे drop हो तब split करती है। Semantic chunking documentation, FAQs, और conversational data जैसे narrative content के लिए recall में 9% तक improve करती है।¹⁴ Approach को ingestion के दौरान अधिक compute और similarity thresholds की careful tuning की आवश्यकता होती है।

Recursive chunking hierarchical splitting rules apply करती है, पहले large splits (section breaks) attempt करती है, फिर progressively smaller ones (paragraph breaks, sentence breaks) जब तक chunks target size तक नहीं पहुंच जाते। LangChain का RecursiveCharacterTextSplitter इस pattern को implement करता है, per-corpus tuning के बिना diverse document types में strong performance प्राप्त करता है। Recursive chunking implementation simplicity और retrieval quality के बीच balance करती है, जिससे यह नई systems के लिए default recommendation बन जाती है।

Page-level chunking NVIDIA benchmarks से emerge हुई जो document types में lowest variance के साथ 0.648 accuracy दिखाती है।¹⁵ Reports और papers जैसे structured documents के लिए, प्रत्येक page को एक chunk के रूप में treat करना spatial relationships और cross-references को preserve करता है। Page-level approaches clear page boundaries की कमी वाले documents (HTML, chat logs, code) के लिए poorly काम करती हैं लेकिन PDF-heavy corpora के लिए excel करती हैं।

Hierarchical chunking nested granularity के साथ multi-level indexes बनाती है—section, subsection, paragraph, और sentence levels। Retrieval पहले relevant sections identify करती है, फिर specific p में drill करती है

[अनुवाद के लिए सामग्री truncated की गई]

RAG इंफ्रास्ट्रक्चर: प्रोडक्शन Retrieval-Augmented Generation सिस्टम का निर्माण

मूल आर्किटेक्चर जो हर प्रोडक्शन RAG सिस्टम को चाहिए

Vector database का चयन operational complexity निर्धारित करता है

Embedding model का चयन cost और retrieval quality दोनों को प्रभावित करता है

Chunking strategies retrieval precision निर्धारित करती हैं

You Might Also Like

AI Workload Scheduling: समय क्षेत्रों में GPU उपयोग का अनुकू...

AI Infrastructure Security Operations: GPU Clusters के लिए S...

$600B AI Infrastructure निर्माण: Hyperscaler CapEx, ऋण, और आ...

कोटेशन का अनुरोध करें_

अनुरोध प्राप्त हुआ_