Back to Blog

NVLink and scale-up networking: when 800G Ethernet isn't enough

NVLink 5 delivering 1.8TB/s per GPU (18 links × 100GB/s)—14x PCIe Gen5 bandwidth. GB200 NVL72 connecting 72 GPUs with 130TB/s aggregate. NVSwitch enabling 576 GPUs in non-blocking fabric with 1PB/s...

NVLink and scale-up networking: when 800G Ethernet isn't enough

NVLink and scale-up networking: when 800G Ethernet isn't enough

Updated December 11, 2025

December 2025 Update: NVLink 5 delivering 1.8TB/s per GPU (18 links × 100GB/s)—14x PCIe Gen5 bandwidth. GB200 NVL72 connecting 72 GPUs with 130TB/s aggregate. NVSwitch enabling 576 GPUs in non-blocking fabric with 1PB/s total bandwidth. Fifth-gen NVLink achieving 12x bandwidth of first-gen (2014). Scale-up networking creates capabilities scale-out networks cannot match.

A single NVIDIA Blackwell GPU supports up to 18 NVLink connections at 100 gigabytes per second each, delivering 1.8 terabytes per second of total bandwidth—14 times the bandwidth of PCIe Gen5.¹ The GB200 NVL72 system connects 72 GPUs in a single NVLink domain with 130 terabytes per second of aggregate bandwidth.² NVIDIA's NVLink Switch enables 576 GPUs in a non-blocking compute fabric with over 1 petabyte per second of total bandwidth.³ Scale-up networking creates infrastructure capabilities that scale-out Ethernet and InfiniBand networks cannot match.

The distinction between scale-up and scale-out networking defines modern AI infrastructure architecture. NVLink and NVSwitch handle communication within nodes and racks, delivering the bandwidth and latency required for tensor parallelism across large models. InfiniBand and Ethernet handle communication between racks, providing the reach required for data parallelism across thousands of GPUs. Understanding when each technology applies determines whether infrastructure investments deliver expected performance.

Fifth-generation NVLink doubles bandwidth compared to the previous generation.⁴ Each link operates at 100 gigabytes per second bidirectional, with 18 links per Blackwell GPU providing 1.8 terabytes per second aggregate bandwidth.⁵ The improvement exceeds PCIe Gen5 bandwidth by more than 14 times.⁶

The evolution across generations demonstrates the trajectory:

Generation Architecture Links Per-GPU Bandwidth
1st (2018) Volta V100 6 300 GB/s
2nd (2020) Ampere A100 12 600 GB/s
3rd (2022) Hopper H100 18 900 GB/s
4th (2024) Blackwell B200 18 1.8 TB/s

Fifth-generation NVLink achieves 12 times the bandwidth of first-generation NVLink introduced in 2014.⁷ The doubling from Hopper to Blackwell reflects the increasing bandwidth requirements of trillion-parameter models.

The per-link bandwidth increase from 50 gigabytes per second in NVLink 4 to 100 gigabytes per second in NVLink 5 enables the same 18-link configuration to double total throughput.⁸ The architecture maintains link count while improving signaling rates.

NVSwitch architecture evolution

NVIDIA introduced NVSwitch with the DGX-2 system in 2018 to enable full interconnectivity among GPUs within a single system.⁹ NVSwitch acts as a high-speed, non-blocking crossbar switch for NVLink traffic, allowing every GPU in a system to communicate with every other GPU at full speed.¹⁰

The DGX-2 featured 16 V100 GPUs connected through first-generation NVSwitch.¹¹ Each NVSwitch chip provided 18 NVLink ports with 900 gigabytes per second aggregate switching capacity.¹² The 100-watt chip fabricated on TSMC 12nm contained 2 billion transistors.¹³

Second-generation NVSwitch arrived with DGX A100 in 2020, supporting NVLink 3.0 at 600 gigabytes per second per GPU.¹⁴ Six NVSwitch chips created a fully connected network topology for eight A100 GPUs.¹⁵

Third-generation NVSwitch for Hopper increased to 25.6 terabits per second total bidirectional bandwidth per chip.¹⁶ Four NVSwitch chips in each HGX H100 and HGX H200 system provide 3.6 terabytes per second bidirectional network bandwidth across eight GPUs.¹⁷ Third-generation NVSwitch introduced SHARP functionality for in-network computation, aggregating and updating results across multiple GPU units without requiring round trips to individual GPUs.¹⁸

Fourth-generation NVSwitch for Blackwell features 72 NVLink 5.0 ports per chip.¹⁹ The NVLink 5 Switch provides 144 NVLink ports with 14.4 terabytes per second non-blocking switching capacity.²⁰ This generation introduced rack-level switching, moving NVSwitch from servers to dedicated switch trays.

GB200 NVL72 scale-up architecture

The GB200 NVL72 connects 36 Grace CPUs and 72 Blackwell GPUs in a rack-scale, liquid-cooled design.²¹ The 72-GPU NVLink domain acts as a single, massive GPU and delivers 30 times faster real-time trillion-parameter large language model inference compared to previous generations.²²

The physical architecture distributes components across compute trays and switch trays.²³ Each compute tray contains two GB200 Superchips, with each Superchip comprising two B200 GPUs and one Grace CPU.²⁴ The system contains 18 compute trays totaling 72 GPUs.

Nine NVLink switch trays provide full-mesh connectivity.²⁵ Each switch tray contains two NVLink Switch chips with 144 NVLink ports total.²⁶ The nine switches fully connect each of the 18 NVLink ports on every Blackwell GPU.²⁷

No direct GPU-to-GPU connectivity exists within a single server or compute tray.²⁸ All communication routes through the external NVSwitch fabric.²⁹ This architecture makes all 72 GPUs equivalent from a connectivity perspective—any GPU can communicate with any other at the same bandwidth and latency.³⁰

The 130 terabytes per second aggregate NVLink bandwidth enables tensor parallelism across all 72 GPUs.³¹ Large models that exceed single-GPU memory capacity can distribute tensors across the entire domain with minimal communication overhead. The architecture eliminates the traditional boundary between server and rack networking for scale-up workloads.

Scale-up versus scale-out networking

Scale-up networking (NVLink) and scale-out networking (InfiniBand and Ethernet) serve fundamentally different purposes in AI infrastructure.³²

NVLink excels at fast communication between GPUs within a single domain—far faster than InfiniBand, with bandwidth on the order of terabytes per second for local connections.³³ The low latency and high bandwidth support tensor parallelism, where model weights distribute across GPUs and must synchronize at every layer. NVLink's 1.8 terabytes per second per GPU enables this synchronization without becoming a bottleneck.

NVLink does not help once communication crosses node boundaries.³⁴ Inter-node networking requires InfiniBand or Ethernet regardless of intra-node NVLink capability. The technologies operate at different layers of the hierarchy.

InfiniBand provides the industry standard for connecting thousands of server nodes.³⁵ Remote Direct Memory Access (RDMA) allows servers to exchange data directly between memory spaces, bypassing CPU and OS overhead.³⁶ The feature proves essential for large-scale distributed training using data parallelism, where each node processes different batches and synchronizes gradients.

InfiniBand remains the gold standard for AI training at scale, connecting more than 270 of the world's top supercomputers.³⁷ The adaptive routing, congestion control, and RDMA capabilities were designed specifically for synchronous, high-performance computing.

Ethernet is overtaking InfiniBand for scale-out deployments.³⁸ NVIDIA's Spectrum-X brings InfiniBand innovations to Ethernet, including telemetry-driven congestion control, adaptive load balancing, and direct data placement.³⁹ Large-scale systems with Spectrum-X have achieved 95% data throughput with zero application latency degradation, compared to only 60% throughput from standard Ethernet fabrics.⁴⁰

The hierarchical model combines these technologies appropriately. NVLink handles scale-up within the rack, delivering approximately 18 times the bandwidth of scale-out networking.⁴¹ InfiniBand or Ethernet handles scale-out between racks, providing reach across thousands of nodes. Each GPU tray in GB200 NVL72 systems includes 800 gigabits per second RDMA NICs for inter-rack communication.⁴²

576-GPU domains and SuperPOD architecture

The NVLink Switch enables 576 fully connected GPUs in a non-blocking compute fabric.⁴³ Eight GB200 NVL72 racks form a SuperPOD, creating a supernode of 576 GPUs with over 1 petabyte per second total bandwidth and 240 terabytes of fast memory.⁴⁴

DGX SuperPOD is built on scalable units (SU), each containing eight DGX GB200 systems.⁴⁵ The modular design enables rapid deployment of SuperPODs at any scale. The reference architecture includes specifications for InfiniBand, NVLink network, Ethernet fabric topologies, storage systems, rack layouts, and wiring.⁴⁶

The 576-GPU domain maintains the fully connected NVLink topology across all racks in the SuperPOD.⁴⁷ Any GPU can communicate with any other at 1.8 terabytes per second without traversing scale-out networking.⁴⁸ The domain size matches the requirements of the largest foundation models currently in training.

SuperPOD deployment requires on-premises installation.⁴⁹ Customers own and manage the hardware within their data centers or co-located commercial facilities. The architecture mirrors NVIDIA's internal research and development systems, meaning infrastructure software, applications, and support are tested on identical configurations.⁵⁰

Cloud providers Microsoft Azure, Oracle Cloud, and CoreWeave committed to supporting the X800 architecture when available in 2025.⁵¹ Cloud deployments extend SuperPOD capabilities to organizations that cannot justify dedicated on-premises infrastructure.

Enterprise Kubernetes deployment

Multi-Node NVLink (MNNVL) systems require specialized Kubernetes configuration.⁵² Kubernetes does not natively recognize NVIDIA's MNNVL architecture, making workload management and scheduling more complex than standard GPU deployments.⁵³

Deployment requirements include Kubernetes 1.32 or later and NVIDIA GPU Operator version 25.3 or above.⁵⁴ The GPU Operator version must include the Dynamic Resource Allocation (DRA) driver, which provides support for GB200 accelerated networking resources and the ComputeDomain feature.⁵⁵ NVIDIA Network Operator handles network configuration.

The IMEX service supports GPU memory export and import across OS domains in NVLink multi-node deployments.⁵⁶ The service enables NVLink peer-to-peer communication and shared memory operations across the domain.

When distributed workloads target MNNVL node pools, the platform creates a ComputeDomain Custom Resource Definition (CRD) to manage NVLink domain assignments.⁵⁷ A reference to the ComputeDomain automatically attaches to workload specifications as a resource claim, allowing the scheduler to bind workloads to specific NVLink domains.⁵⁸

Pod affinity rules use the MNNVL label key (nvidia.com/gpu.clique) as the topology key.⁵⁹ The configuration ensures that pods within distributed workloads land on nodes with NVLink interconnects, maintaining the topology required for performance.⁶⁰

Infrastructure planning considerations

Organizations evaluating NVLink infrastructure should consider workload characteristics first. Tensor parallelism across large models benefits directly from NVLink bandwidth. Data parallelism across many smaller models may not require NVLink's capabilities and can achieve adequate performance with scale-out networking alone.

The GB200 NVL72 represents a significant infrastructure commitment. Liquid cooling requirements, rack-scale integration, and specialized networking exceed the complexity of air-cooled, server-centric architectures. Organizations should validate that workloads require these capabilities before committing.

Power and cooling infrastructure must accommodate liquid cooling from initial deployment. The GB200 NVL72 cannot operate with air cooling. Retrofitting facilities for liquid cooling after deployment is expensive and disruptive.

Network planning should address both scale-up and scale-out requirements. The NVLink fabric handles intra-rack communication, but inter-rack communication requires separate InfiniBand or Ethernet infrastructure. Budget and plan for both layers.

Kubernetes expertise specific to MNNVL becomes essential for production deployments. Standard Kubernetes GPU scheduling does not recognize NVLink topology. Organizations need teams capable of configuring and managing the specialized resources.

Cloud-based access provides evaluation opportunities without capital commitment. Azure, Oracle Cloud, and CoreWeave offerings enable workload testing on GB200 infrastructure before deciding on dedicated deployments.

The scale-up networking capabilities of NVLink and NVSwitch enable AI workloads that scale-out networks cannot support efficiently. Organizations training or serving the largest models require understanding when these capabilities justify their infrastructure complexity. The hierarchical model—NVLink for scale-up, InfiniBand or Ethernet for scale-out—represents the blueprint for state-of-the-art AI infrastructure.


References

  1. NVIDIA. "NVLink & NVSwitch for Advanced Multi-GPU Communication." 2025. https://www.nvidia.com/en-us/data-center/nvlink/

  2. NVIDIA. "GB200 NVL72." 2025. https://www.nvidia.com/en-us/data-center/gb200-nvl72/

  3. NVIDIA. "NVLink & NVSwitch for Advanced Multi-GPU Communication."

  4. AMAX. "Fifth-Generation NVIDIA NVLink." 2025. https://www.amax.com/fifth-generation-nvidia-nvlink/

  5. AMAX. "Fifth-Generation NVIDIA NVLink."

  6. AMAX. "Fifth-Generation NVIDIA NVLink."

  7. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution: Topology and Rates." 2025. https://www.fibermall.com/blog/nvidia-nvlink-and-nvswitch-evolution.htm

  8. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  9. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  10. Face of IT. "NVLink vs. InfiniBand vs. NVSwitch: The 2025 Guide for AI & HPC." 2025. https://www.faceofit.com/nvlink-vs-infiniband-vs-nvswitch/

  11. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  12. NVIDIA Technical Overview. "NVIDIA NVSwitch: The World's Highest-Bandwidth On-Node Switch." https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf

  13. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  14. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  15. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  16. NVIDIA Developer Blog. "Upgrading Multi-GPU Interconnectivity with the Third-Generation NVIDIA NVSwitch." 2022. https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch/

  17. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  18. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  19. FiberMall. "Understanding Nvidia's NvLink and NvSwitch Evolution."

  20. NVIDIA. "NVLink & NVSwitch for Advanced Multi-GPU Communication."

  21. NVIDIA. "GB200 NVL72."

  22. NVIDIA. "GB200 NVL72."

  23. Nebius. "Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72." 2025. https://nebius.com/blog/posts/leveraging-nvidia-gb200-nvl72-gpu-interconnect

  24. NVIDIA DGX SuperPOD Documentation. "Key Components of the DGX SuperPOD." 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/dgx-superpod-components.html

  25. NVIDIA DGX SuperPOD Documentation. "Key Components of the DGX SuperPOD."

  26. NADDOD Blog. "NVIDIA GB200 Interconnect Architecture Analysis: NVLink, InfiniBand, and Future Trends." 2025. https://www.naddod.com/blog/nvidia-gb200-interconnect-architecture-analysis-nvlink-infiniband-and-future-trends

  27. Nebius. "Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72."

  28. Nebius. "Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72."

  29. Nebius. "Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72."

  30. Nebius. "Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72."

  31. NVIDIA Developer Blog. "NVIDIA GB200 NVL72 Delivers Trillion-Parameter LLM Training and Real-Time Inference." 2024. https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/

  32. Network Switch. "Scale-Up vs. Scale-Out in AI Infrastructure: Key Differences 2025." 2025. https://network-switch.com/blogs/networking/scale-up-vs-scale-out-in-ai-infrastructure

  33. Face of IT. "NVLink vs. InfiniBand vs. NVSwitch."

  34. Face of IT. "NVLink vs. InfiniBand vs. NVSwitch."

  35. ARC Compute. "InfiniBand vs. Ethernet: Choosing the Right Network Fabric for AI Clusters." 2025. https://www.arccompute.io/arc-blog/infiniband-vs-ethernet-choosing-the-right-network-fabric-for-ai-clusters

  36. RunPod. "Do I need InfiniBand for distributed AI training?" 2025. https://www.runpod.io/articles/guides/infiniband-for-distributed-ai-training

  37. ARC Compute. "InfiniBand vs. Ethernet."

  38. LightCounting. "AI Scale-up Switches Overtake Ethernet and InfiniBand." April 2025. https://www.lightcounting.com/newsletter/en/april-2025-ethernet-infiniband-and-optical-switches-for-cloud-datacenters-335

  39. Face of IT. "NVLink vs. InfiniBand vs. NVSwitch."

  40. Face of IT. "NVLink vs. InfiniBand vs. NVSwitch."

  41. NADDOD Blog. "NVIDIA GB200 Interconnect Architecture Analysis."

  42. NVIDIA DGX SuperPOD Documentation. "Network Fabrics." 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/network-fabrics.html

  43. NVIDIA. "NVLink & NVSwitch for Advanced Multi-GPU Communication."

  44. AMAX. "Fifth-Generation NVIDIA NVLink."

  45. NVIDIA DGX SuperPOD Documentation. "Key Components of the DGX SuperPOD."

  46. NVIDIA DGX SuperPOD Documentation. "Key Components of the DGX SuperPOD."

  47. FiberMall. "How does NVIDIA Build the AI Supercomputer SuperPod?" 2025. https://www.fibermall.com/blog/supercomputer-superpod.htm

  48. FiberMall. "How does NVIDIA Build the AI Supercomputer SuperPod?"

  49. NVIDIA DGX SuperPOD Documentation. "Abstract." 2025. https://docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-gb200/latest/abstract.html

  50. NVIDIA DGX SuperPOD Documentation. "Abstract."

  51. HPC Wire. "NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks." March 2024. https://www.hpcwire.com/2024/03/25/nvlink-faster-interconnects-and-switches-to-help-relieve-data-bottlenecks/

  52. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains." 2025. https://run-ai-docs.nvidia.com/saas/platform-management/aiinitiatives/resources/using-gb200-nvl72-and-multi-node-nvlink-domains

  53. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  54. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  55. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  56. NVIDIA Documentation. "NVIDIA Multi-Node NVLink Systems." 2025. https://docs.nvidia.com/multi-node-nvlink-systems/index.html

  57. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  58. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  59. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."

  60. Run:ai Documentation. "Using GB200 NVL72 and Multi-Node NVLink Domains."


Key takeaways

For infrastructure architects: - NVLink 5 delivers 1.8TB/s per GPU (18 links × 100GB/s)—14x PCIe Gen5 bandwidth; GB200 NVL72 provides 130TB/s aggregate - Scale-up (NVLink within rack) delivers ~18x bandwidth of scale-out (InfiniBand/Ethernet between racks) - NVSwitch evolution: 900GB/s (Gen1/V100) → 600GB/s (Gen2/A100) → 25.6Tb/s (Gen3/H100) → 14.4TB/s (Gen4/Blackwell)

For procurement teams: - GB200 NVL72: 36 Grace CPUs, 72 Blackwell GPUs, 9 NVLink switch trays; liquid cooling mandatory - 576-GPU SuperPOD (8 NVL72 racks) achieves 1PB/s bandwidth and 240TB fast memory - Cloud options: Azure, Oracle Cloud, CoreWeave committed to X800 architecture availability in 2025

For platform teams: - Kubernetes MNNVL deployment requires K8s 1.32+, GPU Operator 25.3+ with DRA driver, Network Operator - ComputeDomain CRD manages NVLink domain assignments; pod affinity uses nvidia.com/gpu.clique topology key - IMEX service enables GPU memory export/import across OS domains for peer-to-peer communication

For operations teams: - No direct GPU-to-GPU connectivity within compute tray—all communication routes through external NVSwitch fabric - Each GPU tray includes 800Gb/s RDMA NICs for inter-rack communication; network planning must address both scale-up and scale-out - Liquid cooling required from initial deployment; retrofitting air-cooled facilities is expensive and disruptive


SEO Elements

Squarespace Excerpt (159 characters): NVLink 5 delivers 1.8TB/s per GPU—14x PCIe Gen5. GB200 NVL72 connects 72 GPUs at 130TB/s. When scale-out networking can't match AI training bandwidth requirements.

SEO Title (55 characters): NVLink Scale-Up Networking: When 800G Ethernet Isn't Enough

SEO Description (155 characters): NVLink 5 delivers 1.8TB/s per GPU, 14x PCIe bandwidth. GB200 NVL72 130TB/s aggregate. Analysis of scale-up vs scale-out networking for AI infrastructure.

URL Slugs: - Primary: nvlink-scale-up-networking-gpu-interconnect-infrastructure - Alt 1: nvlink-5-nvswitch-gb200-nvl72-specifications - Alt 2: scale-up-vs-scale-out-ai-training-networking - Alt 3: nvlink-infiniband-ethernet-ai-infrastructure-comparison

Request a Quote_

Tell us about your project and we'll respond within 72 hours.

> TRANSMISSION_COMPLETE

Request Received_

Thank you for your inquiry. Our team will review your request and respond within 72 hours.

QUEUED FOR PROCESSING