AI Data Pipeline Architecture: Feeding Petabyte-Scale Training at 100GB/s
Updated December 11, 2025
December 2025 Update: Meta's Data PreProcessing Service (DPP) now eliminates data stalls across exabyte-scale training clusters. WEKApod achieving 720GB/s throughput from 8 storage nodes powering 768 H100 GPUs. PCIe Gen5 NVMe SSDs exceeding 14GB/s sequential reads becoming standard for training tier storage. Feature stores and tiered caching architectures reducing cold data access latency by 10x.
Meta discovered that 56% of GPU cycles sat stalled, waiting for training data.1 The company stores exabytes of training data in Tectonic, their distributed file system, but lacked the storage capacity to keep petabyte-scale datasets local to training hardware.2 The solution required building a Data PreProcessing Service (DPP) that scales to eliminate data stalls entirely. Organizations training large models face the same fundamental challenge: the most powerful GPUs accomplish nothing while waiting for input data.
The storage that feeds AI training determines whether GPU investments deliver expected returns. WEKApod achieves over 720GB/s throughput and 18 million IOPS with sub-150 microsecond latencies, powering 768 H100 GPUs from just 8 storage nodes.3 Meta's RSC supercomputer uses 46 petabytes of cache storage to keep GPUs fed.4 Training GPT-4 required approximately 25,000 A100 GPUs processing 13 trillion tokens over 90-100 days.5 At scale, data pipeline architecture becomes as critical as compute architecture.
The data pipeline challenge
Large language models require access to petabytes of high-quality, preprocessed data. Without fast, reliable storage, even the most powerful GPUs sit idle waiting for input.6 The performance tier of storage infrastructure enables the seamless flow of data through compute-intensive pipeline stages: normalization, tokenization, and training.
A typical machine learning pipeline involves data preprocessing managed by CPUs, model training offloaded to GPUs, and post-processing returned to CPUs.7 Bottlenecks occur when transferring data between CPU RAM and GPU DRAM. The mismatch between storage throughput, network bandwidth, preprocessing compute, and GPU consumption creates stalls that waste expensive accelerator capacity.
Meta's data storage and ingestion architecture
Meta's end-to-end DSI pipeline consists of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales preprocessing independently from training compute.8 The architecture separates storage, preprocessing, and training into distinct scalable tiers.
Tectonic serves as Meta's exabyte-scale distributed file system, providing disaggregated storage infrastructure for AI training models.9 The company trains models on terabyte- to petabyte-scale datasets without local storage capacity matching those scales. Disaggregated storage enables flexible resource allocation but requires high-bandwidth networks connecting storage to compute.
The DPP Master receives session specifications containing dataset tables, partitions, required features, and transformation operations.10 The Master breaks preprocessing workloads across petabytes of data into independent, self-contained work items called splits. DPP Workers request splits from the Master and execute preprocessing transformations, decoupling preprocessing throughput from training node CPU capacity.
Storage hierarchy and caching
Meta builds toward tiered storage solutions combining HDDs and SSDs, with SSDs serving as caching tiers for high-reuse features.11 Not all training data requires the same access patterns: frequently accessed features benefit from flash storage while cold data remains on capacity-optimized media.
The caching strategy reduces storage costs without sacrificing training throughput. Hot data residing in fast tiers serves the majority of reads while cold data streams from capacity storage during initial epochs. Understanding data access patterns enables intelligent tiering decisions that balance cost against performance.
Storage technologies for AI training
Different storage technologies serve different roles in AI data pipelines. The choice depends on access patterns, capacity requirements, and budget constraints.
Parallel file systems
Parallel file systems like Lustre and GPFS deliver extreme performance with massive concurrency, making them ideal for synchronous I/O-intensive AI workloads.12 These systems stripe data across many storage servers, providing aggregate bandwidth that scales with server count.
Google Cloud offers Managed Lustre as a high-performance cache atop Cloud Storage, accelerating AI workloads requiring extremely high throughput and low-latency I/O operations.13 Organizations import and export data between Managed Lustre and Cloud Storage, using the parallel file system as a performance tier for active training while maintaining data in object storage for durability.
NVMe storage
PCIe Gen5 NVMe SSDs exceed 14 GB/s sequential read throughput and handle millions of random read IOPS.14 The technology eliminates storage as a bottleneck when training AI models on tens of terabytes of data. PCIe Gen5 adoption throughout 2024-2025 doubled per-lane throughput to approximately 4 GB/s per lane, reaching 64 GB/s in x16 configurations.
NVMe-oF (NVMe over Fabrics) extends NVMe performance across networks, enabling disaggregated storage architectures that maintain near-local latencies. Training clusters access shared NVMe storage pools without sacrificing the performance advantages of direct-attached drives.
Object storage for cold data
Object storage provides cost-effective capacity for petabyte-scale datasets that tolerate higher latencies. A large e-commerce company stores hundreds of petabytes of training data in AWS S3, with AI/ML training workloads distributed across multiple AWS regions and on-premises data centers.15
Object storage works best for batch ingestion patterns where training jobs load data into faster tiers before intensive processing begins. The economics favor object storage for archive and backup while performance tiers handle active training I/O.
Preprocessing at scale
Data preprocessing consumes significant compute resources and often becomes the bottleneck preventing full GPU utilization. Meta's experience showed CPUs on trainer nodes could not preprocess data fast enough to serve GPUs, motivating the distributed DPP architecture.16
Distributed preprocessing workers
The DPP architecture scales preprocessing workers independently from training nodes.17 Adding preprocessing capacity requires only adding worker instances, not modifying training infrastructure. The separation enables organizations to right-size preprocessing compute for specific datasets and transformation complexity.
Worker instances execute transformation operations including cleaning, normalization, tokenization, and feature extraction. Complex transformations require more preprocessing compute per training throughput unit. Simple transformations may keep pace with training using minimal preprocessing resources.
Accelerated preprocessing
Industry efforts increasingly execute preprocessing transformation operations on accelerators rather than CPUs.18 NVIDIA DALI (Data Loading Library) offloads image decoding, augmentation, and format conversion to GPUs. Accelerated preprocessing eliminates CPU bottlenecks for image and video training pipelines.
Moving preprocessing to GPUs requires careful pipeline design to avoid creating new bottlenecks. GPU memory used for preprocessing reduces memory available for model parameters and activations. The tradeoff between preprocessing acceleration and training capacity depends on workload characteristics.
Feature stores
Google recommends using Vertex AI Feature Store for features ready for online serving.19 Feature stores precompute and cache feature values, eliminating repeated computation across training runs. Scheduling feature engineering jobs to regularly compute new feature values at required cadence ensures fresh data without real-time preprocessing overhead.
Feature stores prove particularly valuable for recommendation models where feature computation complexity exceeds per-request time budgets. Training and inference can both access the same precomputed features, maintaining consistency between development and production.
Network architecture for data pipelines
High-bandwidth interconnects provide the foundation for disaggregated storage architectures. InfiniBand and RoCE (RDMA over Converged Ethernet) deliver ultra-low latency and high throughput essential for distributed training across GPU clusters and rapid dataset access.20
Storage network design
Storage networks must match aggregate read throughput to GPU training consumption. A cluster of 1,000 H100 GPUs training a data-hungry workload may require tens of gigabytes per second of sustained storage throughput. Network capacity between storage and compute tiers must exceed this requirement with headroom for burst patterns.
Network topology affects achievable throughput. Fat-tree topologies provide full bisection bandwidth but cost more than oversubscribed designs. Training workloads with heavy storage I/O benefit from non-blocking fabrics that eliminate network congestion as a bottleneck.
Data transfer optimization
Data transfer optimization techniques including parallel I/O, prefetching, caching, compression, and data locality optimization ensure efficient data movement between storage systems and compute nodes.21 Prefetching anticipates data requirements and stages data before compute nodes request it. Compression reduces network bandwidth requirements at the cost of compute cycles.
Batching data reduces transaction frequency, amortizing per-request overhead across larger transfers.22 Filtering data minimizes sample size before sending to GPUs, reducing both storage reads and network transfers. The combination of techniques can reduce effective storage bandwidth requirements significantly.
Building data pipelines at scale
Organizations deploying petabyte-scale training infrastructure need integrated approaches to storage, preprocessing, and networking that match GPU compute capacity.
Capacity planning
Storage capacity planning must account for training data growth alongside model scaling. Training datasets grow as organizations accumulate more data and pursue larger models requiring more tokens. Capacity requirements compound as organizations retain multiple dataset versions for reproducibility.
Throughput planning proves more challenging than capacity planning. The relationship between model size, batch size, and data throughput requirements varies by architecture and training configuration. Benchmarking specific workloads on target infrastructure provides the most reliable throughput requirements.
Infrastructure deployment expertise
Data pipeline infrastructure complexity matches or exceeds compute infrastructure complexity. Storage systems, high-speed networks, and preprocessing services must integrate seamlessly with GPU clusters. Configuration errors in any component create bottlenecks that waste GPU investment.
Introl's network of 550 field engineers specialize in the integrated infrastructure deployments that large-scale AI training requires.23 The company ranked #14 on the 2025 Inc. 5000 with 9,594% three-year growth, reflecting demand for professional infrastructure services.24 Organizations building training clusters benefit from deployment expertise that addresses storage, networking, and compute as an integrated system.
Managing deployments reaching 100,000 GPUs with over 40,000 miles of fiber optic network infrastructure requires operational scale that matches the largest training initiatives.25 Introl operates across 257 locations spanning NAMER, EMEA, APAC, and LATAM, positioning expertise where organizations build training infrastructure.26
Data pipeline performance depends on physical infrastructure quality. Storage systems require proper power, cooling, and rack placement. Network cabling must meet specifications for the selected protocols. Professional deployment ensures the physical layer supports the performance targets storage and network equipment promise.
The data pipeline imperative
Data pipelines will grow in importance as models scale and GPU clusters expand. The 56% stall rate Meta experienced before building DPP would waste billions of dollars across the industry's GPU investments. Organizations cannot afford similar inefficiencies as training runs consume increasing resources.
Future GPU generations will demand even higher storage throughput. NVIDIA's Blackwell GB300 racks will reach 163 kW per rack in 2025, Vera Rubin systems may require 300+ kW in 2026, and Rubin Ultra racks could exceed 600 kW by 2027.27 The compute density increases imply corresponding storage throughput requirements that only purpose-built data pipelines can satisfy.
Organizations beginning AI infrastructure investments should architect data pipelines with the same attention devoted to GPU clusters. The compute-to-storage ratio must maintain balance as both scale. Retrofitting storage performance into existing deployments proves more expensive and disruptive than designing adequate data pipelines from the start.
The combination of distributed preprocessing, tiered storage, high-speed networking, and professional deployment creates data pipelines that keep GPUs fully utilized. Organizations that master data pipeline architecture extract full value from their GPU investments while those with inadequate pipelines watch expensive accelerators sit idle waiting for data.
Key takeaways
For data architects: - Meta found 56% of GPU cycles stalled waiting for data; WEKApod achieves 720GB/s throughput with sub-150μs latency from 8 storage nodes feeding 768 H100s - Tiered storage architecture critical: NVMe for hot data, SSD caching for high-reuse features, HDD for capacity, object storage for cold petabyte-scale datasets - Meta's DPP Master architecture decouples preprocessing from training: breaks workloads into self-contained splits across petabytes of data
For storage engineers: - PCIe Gen5 NVMe SSDs exceed 14GB/s sequential read and millions of random IOPS; NVMe-oF extends performance across networks - Parallel file systems (Lustre, GPFS) deliver extreme performance with massive concurrency for synchronous I/O workloads - Storage capacity planning must account for multiple dataset versions for reproducibility; throughput requirements vary by model architecture
For infrastructure teams: - Network topology determines achievable throughput: fat-tree provides full bisection bandwidth; InfiniBand/RoCE deliver ultra-low latency for disaggregated storage - Data transfer optimization: prefetching anticipates requirements, compression trades compute for bandwidth, batching amortizes per-request overhead - Storage-to-GPU bandwidth must exceed consumption with headroom for bursts; 1,000 H100 cluster may require tens of GB/s sustained throughput
For capacity planners: - GPU hardware represents 50-60% of total AI infrastructure cost; storage, networking, power multiply investment - Future GPU generations demand even higher throughput: Blackwell GB300 at 163kW, Vera Rubin at 300kW+, Rubin Ultra at 600kW+ - Retrofitting storage performance proves more expensive than designing adequate pipelines from inception
For operations teams: - NVIDIA DALI offloads image preprocessing to GPUs; accelerated preprocessing eliminates CPU bottlenecks but requires careful memory tradeoffs - Feature stores (Vertex AI) precompute and cache feature values, eliminating repeated computation across training runs - Meta's RSC uses 46PB cache storage; storage systems require proper power, cooling, and rack placement to achieve performance targets
References
SEO Elements
Squarespace Excerpt (157 characters): Meta found 56% of GPU cycles stalled waiting for data. Learn petabyte-scale AI data pipeline architecture for storage, preprocessing, and 100GB/s throughput.
SEO Title (56 characters): AI Data Pipeline Architecture: Petabyte-Scale Training
SEO Description (154 characters): Build data pipelines feeding petabyte-scale AI training at 100GB/s. Cover distributed preprocessing, tiered storage, and network architecture for GPU clusters.
URL Slugs: - Primary: ai-data-pipeline-architecture-petabyte-scale-training-2025 - Alt 1: ai-training-data-pipeline-100gbs-throughput-guide - Alt 2: petabyte-scale-ai-storage-preprocessing-architecture - Alt 3: gpu-training-data-pipeline-infrastructure-2025
-
Meta. "Optimizing data ingestion infrastructure to support more AI models." Engineering at Meta. September 19, 2022. https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/ ↩
-
Meta. "Optimizing data ingestion infrastructure." September 2022. ↩
-
WEKA. "Unveiling the AI Future: Highlights from NVIDIA's GTC 2024." WEKA Blog. 2024. https://www.weka.io/blog/ai-ml/unveiling-the-ai-future-highlights-from-nvidias-gtc-2024/ ↩
-
WEKA. "Unveiling the AI Future." 2024. ↩
-
Epoch AI. "Can AI scaling continue through 2030?" Epoch AI Blog. 2024. https://epoch.ai/blog/can-ai-scaling-continue-through-2030 ↩
-
Omdia. "The Storage that feeds AI training and modeling for High-Impact AI." September 2025. https://omdia.tech.informa.com/blogs/2025/sep/the-storage-that-feeds-ai-training-and-modeling-for-high-impact-ai ↩
-
MobiDev. "GPU for Machine Learning & AI in 2025: On-Premises vs Cloud." 2025. https://mobidev.biz/blog/gpu-machine-learning-on-premises-vs-cloud ↩
-
arXiv. "Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training." 2021. https://arxiv.org/abs/2108.09373v4 ↩
-
Meta. "Optimizing data ingestion infrastructure." September 2022. ↩
-
arXiv. "Understanding Data Storage and Ingestion." 2021. ↩
-
Meta. "Optimizing data ingestion infrastructure." September 2022. ↩
-
Omdia. "The Storage that feeds AI training." September 2025. ↩
-
Google Cloud. "Design storage for AI and ML workloads in Google Cloud." Cloud Architecture Center. 2025. https://docs.google.com/architecture/ai-ml/storage-for-ai-ml ↩
-
ServersSimply. "Next-Gen HPC & AI Infrastructure 2025: GPUs, CXL, Gen5 NVMe." 2025. https://www.serversimply.com/blog/next-gen-hpc-and-ai-infrastructure-in-2025 ↩
-
Alluxio. "Optimizing I/O for AI Workloads in Geo-Distributed GPU Clusters." Whitepaper. 2024. https://www.alluxio.io/whitepaper/optimizing-i-o-for-ai-workloads-in-geo-distributed-gpu-clusters ↩
-
Meta. "Optimizing data ingestion infrastructure." September 2022. ↩
-
arXiv. "Understanding Data Storage and Ingestion." 2021. ↩
-
Meta. "Optimizing data ingestion infrastructure." September 2022. ↩
-
Google Cloud. "Best practices for implementing machine learning on Google Cloud." Cloud Architecture Center. 2025. https://docs.google.com/architecture/ml-on-gcp-best-practices ↩
-
LakeFS. "AI Data Infrastructure: Components, Challenges & Best Practices." 2024. https://lakefs.io/blog/ai-data-infrastructure/ ↩
-
LakeFS. "AI Data Infrastructure." 2024. ↩
-
MobiDev. "GPU for Machine Learning & AI in 2025." 2025. ↩
-
Introl. "Company Overview." Introl. 2025. https://introl.com ↩
-
Inc. "Inc. 5000 2025." Inc. Magazine. 2025. ↩
-
Introl. "Company Overview." 2025. ↩
-
Introl. "Coverage Area." Introl. 2025. https://introl.com/coverage-area ↩
-
ServersSimply. "Next-Gen HPC & AI Infrastructure 2025." 2025. ↩
-
NVIDIA. "Accelerating AI Data Pipelines: A Case Study Featuring 25 Petabytes of Tiered Storage." GTC 2024. https://www.nvidia.com/en-us/on-demand/session/gtc24-s63072/ ↩
-
arXiv. "Scaling Intelligence: Designing Data Centers for Next-Gen Language Models." 2025. https://arxiv.org/html/2506.15006v3 ↩
-
Servermall. "Infrastructure for machine learning and generative AI." Medium. 2024. https://medium.com/@servermall/infrastructure-for-machine-learning-and-generative-ai-evolution-demands-and-future-directions-37facf907efd ↩
-
Edgecore. "AI Data Center Infrastructure: Powering the Future of AI Compute." 2024. https://edgecore.com/ai-data-center-infastructure/ ↩
-
Meta. "How Meta trains large language models at scale." Engineering at Meta. June 12, 2024. https://engineering.fb.com/2024/06/12/data-infrastructure/training-large-language-models-at-scale-meta/ ↩
-
ACM. "Understanding data storage and ingestion for large-scale deep recommendation model training." ISCA '22. 2022. https://dl.acm.org/doi/10.1145/3470496.3533044 ↩