Embedding Infrastructure at Scale: Vector Generation for Production AI
Updated December 11, 2025
December 2025 Update: Billion-item embedding collections requiring 5.8+ days on single L4 GPU (2,000 tokens/sec). API embedding costs ranging $0.02-0.18 per million tokens. 1B 1024-dim vectors requiring ~4TB storage before indexing. Production RAG applications demanding millisecond similarity search across billions of vectors. Distributed GPU clusters and aggressive caching separating prototypes from production systems.
A single NVIDIA L4 GPU processes approximately 2,000 text tokens per second through a 7-billion parameter embedding model. At that rate, generating embeddings for a billion-item collection requires over 5.8 days on one machine.¹ The falcon-refinedweb dataset with 600 billion tokens would take more than 9.5 years. Embedding infrastructure at scale demands distributed systems, aggressive optimization, and strategic caching—capabilities that separate prototype RAG applications from production-ready knowledge systems.
Embeddings power modern AI applications: semantic search, retrieval-augmented generation, recommendation systems, and similarity matching. Yet organizations consistently underestimate the infrastructure required to generate, store, and serve embeddings at enterprise scale. What begins as a prototype with thousands of embeddings can balloon into a multi-million dollar infrastructure challenge as data grows to billions of vectors.²
The embedding infrastructure challenge
Scale dimensions
Embedding infrastructure must handle three distinct scaling challenges:
Generation throughput: Converting raw text, images, or other content into vector representations. Batch processing billions of documents requires distributed GPU clusters and optimized pipelines.
Storage capacity: High-dimensional vectors consume substantial space. A billion 1024-dimensional float32 vectors requires approximately 4 terabytes before indexing overhead.
Query latency: Production applications demand millisecond-level similarity search across billions of vectors, requiring specialized indexing and caching infrastructure.
Cost dynamics
Engineering teams discover that embeddings silently devour database budgets:³
Compute costs: Embedding generation requires GPU acceleration. API-based embedding costs $0.02-0.18 per million tokens depending on provider and model quality.
Storage costs: Vector databases charge per vector stored and indexed. Costs scale linearly with data volume—doubling vectors doubles storage expense.
Query costs: Similarity search across large collections requires computational resources that increase with collection size and query volume.
A production RAG system processing 10 million documents with 100,000 daily queries might cost $50-100 per day in embedding operations alone—$1,500-3,000 monthly before other infrastructure costs.
Embedding model selection
Provider comparison
OpenAI text-embedding-3:⁴ - Dimensions: 3072 (large), 1536 (small) - Context window: 8,192 tokens - Pricing: $0.13/M tokens (large), $0.02/M tokens (small) - Strengths: Proven reliability, extensive documentation - Considerations: Higher dimension increases storage costs
Voyage AI voyage-3:⁵ - Dimensions: 1024 - Context window: 32,000 tokens - Pricing: $0.06/M tokens - Strengths: Outperforms OpenAI by 9.74% average across domains, 3-4x smaller dimension reduces storage costs - Considerations: Newer provider, smaller ecosystem
Cohere embed-v4: - Dimensions: 1024 - Context window: 512 tokens (limited) - Pricing: Competitive with OpenAI - Strengths: Excellent multilingual support, low latency - Considerations: Short context window limits document handling
Google Gemini embedding: - Dimensions: 768 - Context window: 2,048 tokens - Pricing: Free tier available - Strengths: Cost-effective, good quality - Considerations: Rate limits on free tier
Open-source alternatives
Self-hosted models eliminate per-token costs at the expense of infrastructure management:⁶
E5-Large-V2: - Dimensions: 1024 - Performance: Strong MTEB/BEIR benchmark scores - Best for: General-purpose retrieval - Infrastructure: Runs efficiently on consumer GPUs
BGE-Large: - Dimensions: 1024 - Performance: Competitive with commercial APIs - Best for: Cost-sensitive deployments - Infrastructure: Well-optimized inference
Mistral-embed: - Dimensions: 1024 - Performance: 77.8% accuracy in benchmarks (highest tested) - Best for: Maximum retrieval accuracy - Infrastructure: Requires more GPU memory
GTE-Qwen2-7B: - Dimensions: 4096 - Performance: State-of-the-art quality - Best for: Quality-critical applications - Infrastructure: Requires A100/H100 class GPUs
Selection criteria
| Factor | API Models | Self-Hosted |
|---|---|---|
| Setup complexity | Low | High |
| Per-token cost | $0.02-0.18/M | ~$0 (after infra) |
| Throughput control | Rate-limited | Unlimited |
| Data privacy | External processing | Full control |
| Model updates | Automatic | Manual |
| Fine-tuning | Limited | Full flexibility |
Choose API when: Volume under 100M tokens/month, team lacks ML infrastructure expertise, rapid deployment matters more than cost optimization.
Choose self-hosted when: Volume exceeds 100M tokens/month, data privacy requirements prevent external processing, need custom fine-tuning for domain-specific vocabulary.
Batch processing architecture
Distributed embedding pipelines
Large-scale embedding generation requires distributed processing across multiple GPUs:⁷
SkyPilot approach: By leveraging resources across cloud regions, organizations access hundreds of GPUs simultaneously. One documented deployment used 406 L4 GPUs to achieve 364,400 tokens per second throughput, reducing processing time from 20 hours to 2.3 hours (9x faster).
Pipeline architecture:
┌─────────────────┐
│ Data Source │
│ (S3/GCS/etc) │
└────────┬────────┘
│
┌────────▼────────┐
│ Coordinator │
│ (Job Scheduler)│
└────────┬────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Worker 1│ │ Worker 2 │ │ Worker N │
│ (GPU) │ │ (GPU) │ │ (GPU) │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌────────▼────────┐
│ Vector Store │
│ (Milvus/etc) │
└─────────────────┘
Throughput optimization
Batch size tuning:⁸ Optimal batch size varies significantly with sequence length. For a given GPU, optimal batch size ranges from over 10,000 for short sequences to approximately 500 for long documents. Incorrect batch sizing leaves GPU utilization below 50%.
Sequence sorting: Pre-sorting sentences by length minimizes padding within batches. Tokenizers pad sequences to the longest item in each batch—grouping similar-length inputs reduces wasted computation by 20-40%.
Mixed precision inference: FP16 inference cuts memory usage and speeds processing on GPUs with tensor cores. Most embedding quality degrades negligibly with reduced precision.
# Optimized batch embedding
def embed_documents_optimized(texts, model, batch_size=64):
# Sort by length to minimize padding
sorted_texts = sorted(enumerate(texts), key=lambda x: len(x[1]))
embeddings = [None] * len(texts)
for i in range(0, len(sorted_texts), batch_size):
batch = sorted_texts[i:i+batch_size]
indices, batch_texts = zip(*batch)
# Generate embeddings with GPU tensors
batch_embeddings = model.encode(
batch_texts,
convert_to_tensor=True, # Keep on GPU
normalize_embeddings=True
)
for idx, emb in zip(indices, batch_embeddings):
embeddings[idx] = emb
return embeddings
Cost optimization
Spot instances:⁹ Using spot/preemptible instances reduces embedding generation costs by 61% (from $710 to $277 in one case study). Batch workloads tolerate interruption—checkpoint progress and resume on new instances.
Regional arbitrage: Distribute workloads across cloud regions based on GPU availability and pricing. SkyPilot and similar tools automate cross-region scheduling for cost optimization.
Model selection tradeoffs: Smaller models process faster at lower cost. MiniLM delivers 5-14k sentences/second on CPU versus 1-2k for larger models—5x throughput difference. Benchmark quality requirements against processing costs.
Real-time embedding infrastructure
Query embedding architecture
Production RAG systems generate embeddings for user queries in real-time. Latency directly impacts user experience:¹⁰
Target latencies: - Query embedding: 10-50ms - Vector search: 10-100ms - Total retrieval: 50-200ms
Architecture patterns:
User Query → Load Balancer → Embedding Service → Vector DB → Results
│
┌───────┴───────┐
│ GPU Pool │
│ (N replicas) │
└───────────────┘
Embedding service deployment
Containerized serving: Deploy embedding models as containerized microservices. Kubernetes handles scaling, load balancing, and health checking.
NVIDIA NIM:¹¹ NVIDIA provides pre-optimized inference microservices for embedding models. NIM containers deliver production-ready performance without custom optimization.
vLLM for embeddings: While designed for LLM inference, vLLM supports embedding model serving with optimizations like continuous batching and PagedAttention.
Baseten Performance Client:¹² Custom Rust-based client delivers up to 12x better throughput for batch embedding workloads compared to standard OpenAI SDK implementations.
Latency optimization
Connection pooling: Maintain persistent connections to embedding services. Connection establishment adds 10-50ms overhead per request.
Request batching: Batch multiple queries arriving within short windows. Micro-batching (5-10ms windows) improves throughput while maintaining acceptable latency.
GPU memory management: Keep models loaded in GPU memory. Cold starts add seconds of latency for model loading.
Caching strategies
Why embedding caching matters
Embedding generation consumes computational resources for every request. Caching computed embeddings eliminates redundant calculations:¹³
Savings potential: - Identical query: 100% compute savings - Similar query (semantic cache): 80-95% savings - Corpus embedding: One-time generation cost
Caching layers
In-memory LRU cache:¹⁴ Fastest access for frequently requested embeddings. Hash text content as cache keys—identical text yields cache hits.
from functools import lru_cache
import hashlib
@lru_cache(maxsize=10000)
def get_embedding_cached(text_hash: str, text: str):
return embedding_model.encode(text)
def get_embedding(text: str):
text_hash = hashlib.md5(text.encode()).hexdigest()
return get_embedding_cached(text_hash, text)
Distributed cache (Redis): Share cached embeddings across service instances. Redis provides sub-millisecond access with persistence.
import redis
import numpy as np
redis_client = redis.Redis()
def get_embedding_with_cache(text: str):
cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"
cached = redis_client.get(cache_key)
if cached:
return np.frombuffer(cached, dtype=np.float32)
embedding = model.encode(text)
redis_client.setex(cache_key, 3600, embedding.tobytes())
return embedding
Persistent storage: Store embeddings alongside source documents in databases. Avoid regeneration when documents haven't changed.
Semantic caching
Traditional caching requires exact matches. Semantic caching extends hit rates by matching similar queries:¹⁵
How semantic caching works: 1. Convert query to embedding 2. Search cache for similar embeddings (cosine similarity > threshold) 3. If match found, return cached result 4. Otherwise, compute new embedding and cache
Threshold tuning: - Conservative (0.95+): Minimal false positives, lower hit rate - Moderate (0.85-0.95): Balanced approach - Aggressive (0.75-0.85): Higher hit rate, risk of inappropriate reuse
Implementation considerations: - Query normalization (lowercasing, punctuation removal) increases hit rates - Model updates invalidate entire cache—plan for cache warming - Monitor false positive rates and adjust thresholds
Vector storage and indexing
Storage requirements
Embedding storage scales with dimensionality and precision:
Storage = vectors × dimensions × bytes_per_value
Example (1B vectors, 1024 dims, float32):
= 1,000,000,000 × 1024 × 4 bytes
= 4.096 TB
Dimension reduction strategies:¹⁶ - Matryoshka embeddings: Models trained to support variable dimensions (OpenAI text-embedding-3) - PCA/UMAP: Post-hoc dimension reduction with quality tradeoffs - Binary quantization: 32x compression with ~5% quality loss - Product quantization: 4-8x compression with minimal quality impact
Vector database selection
Milvus:¹⁷ Open-source, scales to tens of billions of vectors. Distributed architecture supports horizontal scaling across nodes.
Pinecone: Managed service with automatic scaling. Eliminates operational overhead at premium pricing.
Qdrant: High-performance open-source option. Strong filtering capabilities for hybrid search.
pgvector: PostgreSQL extension for vector similarity. Excellent for teams already using PostgreSQL, limited at extreme scale.
Index optimization
HNSW (Hierarchical Navigable Small World):¹⁸ Dominant algorithm for approximate nearest neighbor search. Trades index size for query speed—appropriate for most production workloads.
DiskANN: Microsoft's disk-based indexing enables billion-scale search without loading entire index into memory. SQL Server 2025 natively supports DiskANN indexes.
GPU-accelerated search: CUDA and ROCm enable sub-second retrieval across billion-vector collections. NVIDIA cuVS provides optimized GPU vector search primitives.
Production architecture patterns
Microservices decomposition
Separate embedding concerns into independent services:¹⁹
┌──────────────────────────────────────────────────────┐
│ API Gateway │
└──────────────────────────┬───────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│Document │ │ Query │ │ Batch │
│Embedding│ │ Embedding │ │ Pipeline │
│ Service │ │ Service │ │ Service │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ Cache │ │ Cache │ │ Queue │
│ Layer │ │ Layer │ │ (Kafka) │
└────┬────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌──────▼──────┐
│ Vector DB │
│ (Milvus) │
└─────────────┘
Benefits: - Scale query and batch services independently - Different SLAs per service type - Isolate failures between components
Multi-model architecture
Enterprise systems increasingly deploy multiple embedding models:²⁰
Domain specialization: - PubMedBERT for medical documents - CodeBERT for source code - General-purpose model for everything else
Router pattern:
def get_embedding(text: str, domain: str):
if domain == "medical":
return medical_model.encode(text)
elif domain == "code":
return code_model.encode(text)
else:
return general_model.encode(text)
Cross-model compatibility: Different models produce incompatible embedding spaces. Either store separate indexes per model or normalize to common space using learned projections.
RAGOps monitoring
Production embedding infrastructure requires comprehensive observability:²¹
Key metrics: - Embedding generation latency (P50, P95, P99) - Cache hit rates by query type - Vector search latency and recall - GPU utilization and memory pressure - Queue depth for batch processing
Quality monitoring: - Track retrieval precision over time - Detect embedding drift when source data changes - Alert on quality degradation thresholds
Organizations deploying embedding infrastructure at scale can leverage Introl's global expertise for GPU deployment and optimization across 257 locations worldwide.
Cost modeling
API-based economics
Monthly embedding cost (API):
= tokens_per_month × price_per_million / 1,000,000
Example (100M tokens/month, OpenAI small):
= 100,000,000 × $0.02 / 1,000,000
= $2,000/month
Self-hosted economics
Monthly embedding cost (self-hosted):
= GPU_hours × hourly_rate
Example (A100 running 24/7):
= 720 hours × $2.50/hour
= $1,800/month (unlimited tokens)
Breakeven analysis: Self-hosted infrastructure becomes cost-effective when monthly token volume exceeds approximately 50-100M tokens, depending on GPU utilization rates.
Total cost of ownership
Factor complete costs beyond compute:
- Engineering time: Setup, optimization, maintenance
- Storage: Vector database hosting and storage fees
- Redundancy: Multi-region deployment for availability
- Monitoring: Observability infrastructure
The embedding foundation
Embedding infrastructure determines whether AI applications scale efficiently or collapse under load. Organizations building production RAG systems, semantic search, or recommendation engines must treat embedding generation as critical infrastructure deserving careful architecture.
Start with API-based embeddings for rapid iteration. Measure actual token volumes and query patterns. When volumes justify infrastructure investment, migrate to self-hosted models with proper caching, batching, and distributed processing.
The technical choices—model selection, caching strategy, vector database, indexing algorithm—compound into order-of-magnitude differences in cost and performance. A well-architected embedding pipeline processes billions of vectors at cents per million tokens. A naive implementation bankrupts budgets while delivering sluggish performance.
Production embedding infrastructure requires the same engineering rigor applied to databases, APIs, and other foundational systems. The vectors powering modern AI deserve infrastructure built to last.
Key takeaways
For AI/ML engineers: - Single L4 GPU: ~2,000 tokens/sec through 7B model; 1B items = 5.8 days; 600B tokens = 9.5 years on one machine - SkyPilot distributed: 406 L4 GPUs achieved 364,400 tokens/sec, reducing 20 hours to 2.3 hours (9x faster) - Spot instances reduce embedding generation costs 61% (from $710 to $277 in case study)
For model selection: - OpenAI text-embedding-3: 3072/1536 dims, 8K context, $0.02-0.13/M tokens; proven reliability - Voyage AI voyage-3: 1024 dims, 32K context, $0.06/M tokens; outperforms OpenAI by 9.74% average, 3x smaller dimensions - Self-hosted (E5-Large, BGE, Mistral-embed): ~$0 after infra; breakeven at 50-100M tokens/month
For infrastructure teams: - Batch optimization: sort by length to minimize padding (20-40% compute savings); optimal batch size varies 500-10,000+ by sequence length - Caching layers: in-memory LRU for frequently requested, Redis distributed cache (sub-ms), persistent storage alongside documents - Semantic caching: similarity threshold 0.85-0.95 balanced; 80-95% compute savings for similar queries
For storage architects: - Storage calculation: 1B vectors × 1024 dims × 4 bytes = 4.1TB before indexing - Dimension reduction: Matryoshka embeddings (OpenAI), binary quantization (32x compression, ~5% quality loss), product quantization (4-8x) - Vector DB options: Milvus (tens of billions, open-source), Pinecone (managed), Qdrant (hybrid search), pgvector (PostgreSQL)
For cost optimization: - API-based: 100M tokens/month × $0.02/M = $2,000/month (OpenAI small) - Self-hosted: A100 24/7 = 720 hours × $2.50 = $1,800/month (unlimited tokens) - Production RAG (10M docs, 100K daily queries): $50-100/day in embedding operations alone
References
-
SkyPilot Blog. "Large-Scale AI Batch Inference: 9x Faster Embedding Generation." 2025. https://blog.skypilot.co/large-scale-embedding/
-
Medium. "Vector Embeddings at Scale: A Complete Guide to Cutting Storage Costs by 90%." By Rajni Singh. October 2025. https://medium.com/@singhrajni/vector-embeddings-at-scale-a-complete-guide-to-cutting-storage-costs-by-90-a39cb631f856
-
Neum AI. "Retrieval Augmented Generation at scale — Building a distributed system for synchronizing and ingesting billions of text embeddings." Medium. https://medium.com/@neum_ai/retrieval-augmented-generation-at-scale-building-a-distributed-system-for-synchronizing-and-eaa29162521
-
Elephas. "13 Best Embedding Models in 2025: OpenAI vs Voyage AI vs Ollama." 2025. https://elephas.app/blog/best-embedding-models
-
Voyage AI. "voyage-3-large: the new state-of-the-art general-purpose embedding model." January 2025. https://blog.voyageai.com/2025/01/07/voyage-3-large/
-
Baseten. "The best open-source embedding models." 2025. https://www.baseten.co/blog/the-best-open-source-embedding-models/
-
SkyPilot Blog. "Large-Scale AI Batch Inference."
-
Milvus. "How can you do batch processing of sentences for embedding to improve throughput when using Sentence Transformers?" 2025. https://milvus.io/ai-quick-reference/how-can-you-do-batch-processing-of-sentences-for-embedding-to-improve-throughput-when-using-sentence-transformers
-
SkyPilot Blog. "Large-Scale AI Batch Inference."
-
Milvus. "We Benchmarked 20+ Embedding APIs with Milvus: 7 Insights That Will Surprise You." 2025. https://milvus.io/blog/we-benchmarked-20-embedding-apis-with-milvus-7-insights-that-will-surprise-you.md
-
Microsoft Tech Community. "Building an Enterprise RAG Pipeline in Azure with NVIDIA AI Blueprint." 2025. https://techcommunity.microsoft.com/blog/azurearchitectureblog/building-an-enterprise-rag-pipeline-in-azure-with-nvidia-ai-blueprint-for-rag-an/4414301
-
Baseten. "High-performance embedding model inference." 2025. https://www.baseten.co/resources/guide/high-performance-embedding-model-inference/
-
Milvus. "How can caching of computed embeddings help improve application performance when using Sentence Transformers repeatedly on the same sentences?" 2025. https://milvus.io/ai-quick-reference/how-can-caching-of-computed-embeddings-help-improve-application-performance-when-using-sentence-transformers-repeatedly-on-the-same-sentences
-
Medium. "Understanding Caching Layers for Embedding Systems." By Ngoc Phan. October 2025. https://phann123.medium.com/understanding-caching-layers-for-embedding-systems-8ca01b4ae574
-
Redis. "What's the best embedding model for semantic caching?" 2025. https://redis.io/blog/whats-the-best-embedding-model-for-semantic-caching/
-
Medium. "Vector Embeddings at Scale."
-
Milvus. "How do embeddings scale in production systems?" 2025. https://milvus.io/ai-quick-reference/how-do-embeddings-scale-in-production-systems
-
Microsoft DevBlogs. "SQL Server 2025 Vector and AI features." 2025. https://devblogs.microsoft.com/azure-sql/sql-server-2025-embraces-vectors-setting-the-foundation-for-empowering-your-data-with-ai/
-
arXiv. "RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines." 2025. https://arxiv.org/html/2506.03401v1
-
GreenNode. "5 Best Embedding Models for RAG: How to Choose the Right One." 2025. https://greennode.ai/blog/best-embedding-models-for-rag
-
arXiv. "RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines."
SEO Elements
Squarespace Excerpt (159 characters)
Embedding infrastructure at scale requires distributed GPU pipelines and strategic caching. Complete guide to vector generation, model selection, and production deployment.
SEO Title (55 characters)
Embedding Infrastructure: Vector Generation at Scale
SEO Description (154 characters)
Build production embedding infrastructure for RAG and search. Compare OpenAI, Voyage, self-hosted models. Learn batch processing, caching, and cost optimization.
Title Review
Current title "Embedding Infrastructure at Scale: Vector Generation for Production AI" works at 64 characters. Alternatives: - "Embedding Infrastructure: Production Vector Generation Guide" (56 chars) - "Vector Embedding Infrastructure: Scaling to Billions Guide" (54 chars)
URL Slug Recommendations
Primary: embedding-infrastructure-scale-vector-generation-production-guide-2025 Alternative 1: vector-embedding-scale-openai-voyage-self-hosted-comparison Alternative 2: production-embedding-infrastructure-rag-pipeline-optimization Alternative 3: embedding-generation-batch-processing-caching-guide-2025