Embedding Infrastructure at Scale: Vector Generation for Production AI

A single NVIDIA L4 GPU processes approximately 2,000 text tokens per second through a 7-billion parameter embedding model. At that rate, generating embeddings for a billion-item collection requires

Blake Crosley

Feb 24, 2026 13 min read Disclaimer

Embedding Infrastructure at Scale: Vector Generation for Production AI

December 2025 Update: Billion-item embedding collections requiring 5.8+ days on single L4 GPU (2,000 tokens/sec). API embedding costs ranging $0.02-0.18 per million tokens. 1B 1024-dim vectors requiring ~4TB storage before indexing. Production RAG applications demanding millisecond similarity search across billions of vectors. Distributed GPU clusters and aggressive caching separating prototypes from production systems.

A single NVIDIA L4 GPU processes approximately 2,000 text tokens per second through a 7-billion parameter embedding model. At that rate, generating embeddings for a billion-item collection requires over 5.8 days on one machine.¹ The falcon-refinedweb dataset with 600 billion tokens would take more than 9.5 years. Embedding infrastructure at scale demands distributed systems, aggressive optimization, and strategic caching—capabilities that separate prototype RAG applications from production-ready knowledge systems.

Embeddings power modern AI applications: semantic search, retrieval-augmented generation, recommendation systems, and similarity matching. Yet organizations consistently underestimate the infrastructure required to generate, store, and serve embeddings at enterprise scale. What begins as a prototype with thousands of embeddings can balloon into a multi-million dollar infrastructure challenge as data grows to billions of vectors.²

The embedding infrastructure challenge

Scale dimensions

Embedding infrastructure must handle three distinct scaling challenges:

Generation throughput: Converting raw text, images, or other content into vector representations. Batch processing billions of documents requires distributed GPU clusters and optimized pipelines.

Storage capacity: High-dimensional vectors consume substantial space. A billion 1024-dimensional float32 vectors requires approximately 4 terabytes before indexing overhead.

Query latency: Production applications demand millisecond-level similarity search across billions of vectors, requiring specialized indexing and caching infrastructure.

Cost dynamics

Engineering teams discover that embeddings silently devour database budgets:³

Compute costs: Embedding generation requires GPU acceleration. API-based embedding costs $0.02-0.18 per million tokens depending on provider and model quality.

Storage costs: Vector databases charge per vector stored and indexed. Costs scale linearly with data volume—doubling vectors doubles storage expense.

Query costs: Similarity search across large collections requires computational resources that increase with collection size and query volume.

A production RAG system processing 10 million documents with 100,000 daily queries might cost $50-100 per day in embedding operations alone—$1,500-3,000 monthly before other infrastructure costs.

Embedding model selection

Provider comparison

OpenAI text-embedding-3:⁴ - Dimensions: 3072 (large), 1536 (small) - Context window: 8,192 tokens - Pricing: $0.13/M tokens (large), $0.02/M tokens (small) - Strengths: Proven reliability, extensive documentation - Considerations: Higher dimension increases storage costs

Voyage AI voyage-3:⁵ - Dimensions: 1024 - Context window: 32,000 tokens - Pricing: $0.06/M tokens - Strengths: Outperforms OpenAI by 9.74% average across domains, 3-4x smaller dimension reduces storage costs - Considerations: Newer provider, smaller ecosystem

Cohere embed-v4: - Dimensions: 1024 - Context window: 512 tokens (limited) - Pricing: Competitive with OpenAI - Strengths: Excellent multilingual support, low latency - Considerations: Short context window limits document handling

Google Gemini embedding: - Dimensions: 768 - Context window: 2,048 tokens - Pricing: Free tier available - Strengths: Cost-effective, good quality - Considerations: Rate limits on free tier

Open-source alternatives

Self-hosted models eliminate per-token costs at the expense of infrastructure management:⁶

E5-Large-V2: - Dimensions: 1024 - Performance: Strong MTEB/BEIR benchmark scores - Best for: General-purpose retrieval - Infrastructure: Runs efficiently on consumer GPUs

BGE-Large: - Dimensions: 1024 - Performance: Competitive with commercial APIs - Best for: Cost-sensitive deployments - Infrastructure: Well-optimized inference

Mistral-embed: - Dimensions: 1024 - Performance: 77.8% accuracy in benchmarks (highest tested) - Best for: Maximum retrieval accuracy - Infrastructure: Requires more GPU memory

GTE-Qwen2-7B: - Dimensions: 4096 - Performance: State-of-the-art quality - Best for: Quality-critical applications - Infrastructure: Requires A100/H100 class GPUs

Selection criteria

Factor	API Models	Self-Hosted
Setup complexity	Low	High
Per-token cost	$0.02-0.18/M	~$0 (after infra)
Throughput control	Rate-limited	Unlimited
Data privacy	External processing	Full control
Model updates	Automatic	Manual
Fine-tuning	Limited	Full flexibility

Choose API when: Volume under 100M tokens/month, team lacks ML infrastructure expertise, rapid deployment matters more than cost optimization.

Choose self-hosted when: Volume exceeds 100M tokens/month, data privacy requirements prevent external processing, need custom fine-tuning for domain-specific vocabulary.

Batch processing architecture

Distributed embedding pipelines

Large-scale embedding generation requires distributed processing across multiple GPUs:⁷

SkyPilot approach: By leveraging resources across cloud regions, organizations access hundreds of GPUs simultaneously. One documented deployment used 406 L4 GPUs to achieve 364,400 tokens per second throughput, reducing processing time from 20 hours to 2.3 hours (9x faster).

Pipeline architecture:

                    ┌─────────────────┐
                    │  Data Source    │
                    │  (S3/GCS/etc)   │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   Coordinator   │
                    │  (Job Scheduler)│
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐        ┌─────▼─────┐       ┌─────▼─────┐
    │ Worker 1│        │  Worker 2 │       │ Worker N  │
    │  (GPU)  │        │   (GPU)   │       │   (GPU)   │
    └────┬────┘        └─────┬─────┘       └─────┬─────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                    ┌────────▼────────┐
                    │  Vector Store   │
                    │  (Milvus/etc)   │
                    └─────────────────┘

Throughput optimization

Batch size tuning:⁸ Optimal batch size varies significantly with sequence length. For a given GPU, optimal batch size ranges from over 10,000 for short sequences to approximately 500 for long documents. Incorrect batch sizing leaves GPU utilization below 50%.

Sequence sorting: Pre-sorting sentences by length minimizes padding within batches. Tokenizers pad sequences to the longest item in each batch—grouping similar-length inputs reduces wasted computation by 20-40%.

Mixed precision inference: FP16 inference cuts memory usage and speeds processing on GPUs with tensor cores. Most embedding quality degrades negligibly with reduced precision.

# Optimized batch embedding
def embed_documents_optimized(texts, model, batch_size=64):
    # Sort by length to minimize padding
    sorted_texts = sorted(enumerate(texts), key=lambda x: len(x[1]))

    embeddings = [None] * len(texts)
    for i in range(0, len(sorted_texts), batch_size):
        batch = sorted_texts[i:i+batch_size]
        indices, batch_texts = zip(*batch)

        # Generate embeddings with GPU tensors
        batch_embeddings = model.encode(
            batch_texts,
            convert_to_tensor=True,  # Keep on GPU
            normalize_embeddings=True
        )

        for idx, emb in zip(indices, batch_embeddings):
            embeddings[idx] = emb

    return embeddings

Cost optimization

Spot instances:⁹ Using spot/preemptible instances reduces embedding generation costs by 61% (from $710 to $277 in one case study). Batch workloads tolerate interruption—checkpoint progress and resume on new instances.

Regional arbitrage: Distribute workloads across cloud regions based on GPU availability and pricing. SkyPilot and similar tools automate cross-region scheduling for cost optimization.

Model selection tradeoffs: Smaller models process faster at lower cost. MiniLM delivers 5-14k sentences/second on CPU versus 1-2k for larger models—5x throughput difference. Benchmark quality requirements against processing costs.

Real-time embedding infrastructure

Query embedding architecture

Production RAG systems generate embeddings for user queries in real-time. Latency directly impacts user experience:¹⁰

Target latencies: - Query embedding: 10-50ms - Vector search: 10-100ms - Total retrieval: 50-200ms

Architecture patterns:

User Query → Load Balancer → Embedding Service → Vector DB → Results
                                    │
                            ┌───────┴───────┐
                            │  GPU Pool     │
                            │  (N replicas) │
                            └───────────────┘

Embedding service deployment

Containerized serving: Deploy embedding models as containerized microservices. Kubernetes handles scaling, load balancing, and health checking.

NVIDIA NIM:¹¹ NVIDIA provides pre-optimized inference microservices for embedding models. NIM containers deliver production-ready performance without custom optimization.

vLLM for embeddings: While designed for LLM inference, vLLM supports embedding model serving with optimizations like continuous batching and PagedAttention.

Baseten Performance Client:¹² Custom Rust-based client delivers up to 12x better throughput for batch embedding workloads compared to standard OpenAI SDK implementations.

Latency optimization

Connection pooling: Maintain persistent connections to embedding services. Connection establishment adds 10-50ms overhead per request.

Request batching: Batch multiple queries arriving within short windows. Micro-batching (5-10ms windows) improves throughput while maintaining acceptable latency.

GPU memory management: Keep models loaded in GPU memory. Cold starts add seconds of latency for model loading.

Caching strategies

Why embedding caching matters

Embedding generation consumes computational resources for every request. Caching computed embeddings eliminates redundant calculations:¹³

Savings potential: - Identical query: 100% compute savings - Similar query (semantic cache): 80-95% savings - Corpus embedding: One-time generation cost

Caching layers

In-memory LRU cache:¹⁴ Fastest access for frequently requested embeddings. Hash text content as cache keys—identical text yields cache hits.

from functools import lru_cache
import hashlib

@lru_cache(maxsize=10000)
def get_embedding_cached(text_hash: str, text: str):
    return embedding_model.encode(text)

def get_embedding(text: str):
    text_hash = hashlib.md5(text.encode()).hexdigest()
    return get_embedding_cached(text_hash, text)

Distributed cache (Redis): Share cached embeddings across service instances. Redis provides sub-millisecond access with persistence.

import redis
import numpy as np

redis_client = redis.Redis()

def get_embedding_with_cache(text: str):
    cache_key = f"emb:{hashlib.md5(text.encode()).hexdigest()}"

    cached = redis_client.get(cache_key)
    if cached:
        return np.frombuffer(cached, dtype=np.float32)

    embedding = model.encode(text)
    redis_client.setex(cache_key, 3600, embedding.tobytes())
    return embedding

Persistent storage: Store embeddings alongside source documents in databases. Avoid regeneration when documents haven't changed.

Semantic caching

Traditional caching requires exact matches. Semantic caching extends hit rates by matching similar queries:¹⁵

How semantic caching works: 1. Convert query to embedding 2. Search cache for similar embeddings (cosine similarity > threshold) 3. If match found, return cached result 4. Otherwise, compute new embedding and cache

Threshold tuning: - Conservative (0.95+): Minimal false positives, lower hit rate - Moderate (0.85-0.95): Balanced approach - Aggressive (0.75-0.85): Higher hit rate, risk of inappropriate reuse

Implementation considerations: - Query normalization (lowercasing, punctuation removal) increases hit rates - Model updates invalidate entire cache—plan for cache warming - Monitor false positive rates and adjust thresholds

Vector storage and indexing

Storage requirements

Embedding storage scales with dimensionality and precision:

Storage = vectors × dimensions × bytes_per_value

Example (1B vectors, 1024 dims, float32):
= 1,000,000,000 × 1024 × 4 bytes
= 4.096 TB

Dimension reduction strategies:¹⁶ - Matryoshka embeddings: Models trained to support variable dimensions (OpenAI text-embedding-3) - PCA/UMAP: Post-hoc dimension reduction with quality tradeoffs - Binary quantization: 32x compression with ~5% quality loss - Product quantization: 4-8x compression with minimal quality impact

Vector database selection

Milvus:¹⁷ Open-source, scales to tens of billions of vectors. Distributed architecture supports horizontal scaling across nodes.

Pinecone: Managed service with automatic scaling. Eliminates operational overhead at premium pricing.

Qdrant: High-performance open-source option. Strong filtering capabilities for hybrid search.

pgvector: PostgreSQL extension for vector similarity. Excellent for teams already using PostgreSQL, limited at extreme scale.

Index optimization

HNSW (Hierarchical Navigable Small World):¹⁸ Dominant algorithm for approximate nearest neighbor search. Trades index size for query speed—appropriate for most production workloads.

DiskANN: Microsoft's disk-based indexing enables billion-scale search without loading entire index into memory. SQL Server 2025 natively supports DiskANN indexes.

GPU-accelerated search: CUDA and ROCm enable sub-second retrieval across billion-vector collections. NVIDIA cuVS provides optimized GPU vector search primitives.

Production architecture patterns

Microservices decomposition

Separate embedding concerns into independent services:¹⁹

┌──────────────────────────────────────────────────────┐
│                   API Gateway                        │
└──────────────────────────┬───────────────────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
    ┌────▼────┐      ┌─────▼─────┐     ┌─────▼─────┐
    │Document │      │   Query   │     │  Batch    │
    │Embedding│      │ Embedding │     │ Pipeline  │
    │ Service │      │  Service  │     │  Service  │
    └────┬────┘      └─────┬─────┘     └─────┬─────┘
         │                 │                 │
    ┌────▼────┐      ┌─────▼─────┐     ┌─────▼─────┐
    │  Cache  │      │   Cache   │     │   Queue   │
    │  Layer  │      │   Layer   │     │  (Kafka)  │
    └────┬────┘      └─────┬─────┘     └─────┬─────┘
         │                 │                 │
         └─────────────────┼─────────────────┘
                           │
                    ┌──────▼──────┐
                    │ Vector DB   │
                    │  (Milvus)   │
                    └─────────────┘

Benefits: - Scale query and batch services independently - Different SLAs per service type - Isolate failures between components

Multi-model architecture

Enterprise systems increasingly deploy multiple embedding models:²⁰

Domain specialization: - PubMedBERT for medical documents - CodeBERT for source code - General-purpose model for everything else

Router pattern:

def get_embedding(text: str, domain: str):
    if domain == "medical":
        return medical_model.encode(text)
    elif domain == "code":
        return code_model.encode(text)
    else:
        return general_model.encode(text)

Cross-model compatibility: Different models produce incompatible embedding spaces. Either store separate indexes per model or normalize to common space using learned projections.

RAGOps monitoring

Production embedding infrastructure requires comprehensive observability:²¹

Key metrics: - Embedding generation latency (P50, P95, P99) - Cache hit rates by query type - Vector search latency and recall - GPU utilization and memory pressure - Queue depth for batch processing

Quality monitoring: - Track retrieval precision over time - Detect embedding drift when source data changes - Alert on quality degradation thresholds

Organizations deploying embedding infrastructure at scale can leverage Introl's global expertise for GPU deployment and optimization across 257 locations worldwide.

Cost modeling

API-based economics

Monthly embedding cost (API):
= tokens_per_month × price_per_million / 1,000,000

Example (100M tokens/month, OpenAI small):
= 100,000,000 × $0.02 / 1,000,000
= $2,000/month

Self-hosted economics

Monthly embedding cost (self-hosted):
= GPU_hours × hourly_rate

Example (A100 running 24/7):
= 720 hours × $2.50/hour
= $1,800/month (unlimited tokens)

Breakeven analysis: Self-hosted infrastructure becomes cost-effective when monthly token volume exceeds approximately 50-100M tokens, depending on GPU utilization rates.

Total cost of ownership

Factor complete costs beyond compute:

Engineering time: Setup, optimization, maintenance
Storage: Vector database hosting and storage fees
Redundancy: Multi-region deployment for availability
Monitoring: Observability infrastructure

The embedding foundation

Embedding infrastructure determines whether AI applications scale efficiently or collapse under load. Organizations building production RAG systems, semantic search, or recommendation engines must treat embedding generation as critical infrastructure deserving careful architecture.

Start with API-based embeddings for rapid iteration. Measure actual token volumes and query patterns. When volumes justify infrastructure investment, migrate to self-hosted models with proper caching, batching, and distributed processing.

The technical choices—model selection, caching strategy, vector database, indexing algorithm—compound into order-of-magnitude differences in cost and performance. A well-architected embedding pipeline processes billions of vectors at cents per million tokens. A naive implementation bankrupts budgets while delivering sluggish performance.

Production embedding infrastructure requires the same engineering rigor applied to databases, APIs, and other foundational systems. The vectors powering modern AI deserve infrastructure built to last.

Key takeaways

For AI/ML engineers: - Single L4 GPU: ~2,000 tokens/sec through 7B model; 1B items = 5.8 days; 600B tokens = 9.5 years on one machine - SkyPilot distributed: 406 L4 GPUs achieved 364,400 tokens/sec, reducing 20 hours to 2.3 hours (9x faster) - Spot instances reduce embedding generation costs 61% (from $710 to $277 in case study)

For model selection: - OpenAI text-embedding-3: 3072/1536 dims, 8K context, $0.02-0.13/M tokens; proven reliability - Voyage AI voyage-3: 1024 dims, 32K context, $0.06/M tokens; outperforms OpenAI by 9.74% average, 3x smaller dimensions - Self-hosted (E5-Large, BGE, Mistral-embed): ~$0 after infra; breakeven at 50-100M tokens/month

For infrastructure teams: - Batch optimization: sort by length to minimize padding (20-40% compute savings); optimal batch size varies 500-10,000+ by sequence length - Caching layers: in-memory LRU for frequently requested, Redis distributed cache (sub-ms), persistent storage alongside documents - Semantic caching: similarity threshold 0.85-0.95 balanced; 80-95% compute savings for similar queries

For storage architects: - Storage calculation: 1B vectors × 1024 dims × 4 bytes = 4.1TB before indexing - Dimension reduction: Matryoshka embeddings (OpenAI), binary quantization (32x compression, ~5% quality loss), product quantization (4-8x) - Vector DB options: Milvus (tens of billions, open-source), Pinecone (managed), Qdrant (hybrid search), pgvector (PostgreSQL)

For cost optimization: - API-based: 100M tokens/month × $0.02/M = $2,000/month (OpenAI small) - Self-hosted: A100 24/7 = 720 hours × $2.50 = $1,800/month (unlimited tokens) - Production RAG (10M docs, 100K daily queries): $50-100/day in embedding operations alone

References

SkyPilot Blog. "Large-Scale AI Batch Inference: 9x Faster Embedding Generation." 2025. https://blog.skypilot.co/large-scale-embedding/
Medium. "Vector Embeddings at Scale: A Complete Guide to Cutting Storage Costs by 90%." By Rajni Singh. October 2025. https://medium.com/@singhrajni/vector-embeddings-at-scale-a-complete-guide-to-cutting-storage-costs-by-90-a39cb631f856
Neum AI. "Retrieval Augmented Generation at scale — Building a distributed system for synchronizing and ingesting billions of text embeddings." Medium. https://medium.com/@neum_ai/retrieval-augmented-generation-at-scale-building-a-distributed-system-for-synchronizing-and-eaa29162521
Elephas. "13 Best Embedding Models in 2025: OpenAI vs Voyage AI vs Ollama." 2025. https://elephas.app/blog/best-embedding-models
Voyage AI. "voyage-3-large: the new state-of-the-art general-purpose embedding model." January 2025. https://blog.voyageai.com/2025/01/07/voyage-3-large/
Baseten. "The best open-source embedding models." 2025. https://www.baseten.co/blog/the-best-open-source-embedding-models/
SkyPilot Blog. "Large-Scale AI Batch Inference."
Milvus. "How can you do batch processing of sentences for embedding to improve throughput when using Sentence Transformers?" 2025. https://milvus.io/ai-quick-reference/how-can-you-do-batch-processing-of-sentences-for-embedding-to-improve-throughput-when-using-sentence-transformers
SkyPilot Blog. "Large-Scale AI Batch Inference."
Milvus. "We Benchmarked 20+ Embedding APIs with Milvus: 7 Insights That Will Surprise You." 2025. https://milvus.io/blog/we-benchmarked-20-embedding-apis-with-milvus-7-insights-that-will-surprise-you.md
Microsoft Tech Community. "Building an Enterprise RAG Pipeline in Azure with NVIDIA AI Blueprint." 2025. https://techcommunity.microsoft.com/blog/azurearchitectureblog/building-an-enterprise-rag-pipeline-in-azure-with-nvidia-ai-blueprint-for-rag-an/4414301
Baseten. "High-performance embedding model inference." 2025. https://www.baseten.co/resources/guide/high-performance-embedding-model-inference/
Milvus. "How can caching of computed embeddings help improve application performance when using Sentence Transformers repeatedly on the same sentences?" 2025. https://milvus.io/ai-quick-reference/how-can-caching-of-computed-embeddings-help-improve-application-performance-when-using-sentence-transformers-repeatedly-on-the-same-sentences
Medium. "Understanding Caching Layers for Embedding Systems." By Ngoc Phan. October 2025. https://phann123.medium.com/understanding-caching-layers-for-embedding-systems-8ca01b4ae574
Redis. "What's the best embedding model for semantic caching?" 2025. https://redis.io/blog/whats-the-best-embedding-model-for-semantic-caching/
Medium. "Vector Embeddings at Scale."
Milvus. "How do embeddings scale in production systems?" 2025. https://milvus.io/ai-quick-reference/how-do-embeddings-scale-in-production-systems
Microsoft DevBlogs. "SQL Server 2025 Vector and AI features." 2025. https://devblogs.microsoft.com/azure-sql/sql-server-2025-embraces-vectors-setting-the-foundation-for-empowering-your-data-with-ai/
arXiv. "RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines." 2025. https://arxiv.org/html/2506.03401v1
GreenNode. "5 Best Embedding Models for RAG: How to Choose the Right One." 2025. https://greennode.ai/blog/best-embedding-models-for-rag
arXiv. "RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines."