xAI's Memphis Colossus: anatomy of a 100,000 GPU cluster
December 2025 Update: Colossus now comprising 150,000 H100 + 50,000 H200 + 30,000 GB200 GPUs—world's largest single-coherent AI training cluster. Built in 122 days (initial 100K), doubled in 92 more. Planning expansion to 1M GPUs. Drawing 250MW from Memphis utility grid. Spectrum-X Ethernet achieving 95% throughput vs 60% on traditional Ethernet.
Built in 122 days, xAI's Colossus cluster deployed 100,000 NVIDIA H100 GPUs in a former appliance factory in Memphis, Tennessee.¹ Then xAI doubled the system to 200,000 GPUs in 92 additional days.² The cluster currently comprises 150,000 H100 GPUs, 50,000 H200 GPUs, and 30,000 GB200 GPUs, making it the largest fully operational, single-coherent AI training cluster in the world.³ xAI plans to expand to 1 million GPUs.⁴ The project demonstrates what aggressive infrastructure deployment looks like when an organization prioritizes speed over conventional planning timelines.
The Colossus project offers lessons for any organization building AI infrastructure at scale. The decisions around power, cooling, networking, and facility selection reveal how constraints can be overcome when traditional approaches prove too slow. The tradeoffs also reveal risks that more methodical deployments avoid.
Construction timeline and approach
Musk received initial quotes of 18 to 24 months for data center construction.⁵ Rejecting that timeline, xAI found the former Electrolux factory in Memphis, which the appliance maker had opened in 2012 and closed in 2020.⁶ The abandoned facility offered considerable warehouse space and 15 megawatts of initial industrial power.⁷
Supermicro CEO Charles Liang confirmed his company teamed with xAI to build the gargantuan Colossus data center in 122 days.⁸ Both Dell Technologies and Supermicro partnered with xAI on construction.⁹ The compressed timeline required parallel workstreams across facility preparation, power infrastructure, cooling systems, and compute deployment.
The 100,000-GPU cluster uses HGX servers containing eight GPUs each, housed in Supermicro liquid-cooled racks with 64 GPUs per rack.¹⁰ The total deployment comprises 1,500 GPU racks.¹¹ The rack density required liquid cooling from inception, with Supermicro's 4U liquid-cooled systems providing thermal management.¹²
Three months after the initial deployment, xAI announced expansion to 200,000 GPUs with plans to continue scaling to 1 million.¹³ The expansion demonstrated that the infrastructure architecture could accommodate growth without fundamental redesign.
Power infrastructure at unprecedented scale
The Colossus facility currently draws approximately 250 megawatts, up from the initial 150-megawatt configuration.¹⁴ xAI installed 35 gas turbines capable of producing 420 megawatts of power alongside Tesla Megapack battery systems.¹⁵ The hybrid approach provides both baseload power and grid independence.
xAI designed and built the first MLGW substation in 97 days, completing a 150-megawatt substation that would normally take 2.5 years.¹⁶ The acceleration required working with Memphis Light, Gas and Water while simultaneously deploying temporary power solutions.
The company deployed 208 Tesla Megapacks to power the supercomputer, initially isolating it from the MLGW grid.¹⁷ The Megapacks store large amounts of electricity, providing backup during grid disruptions and enabling operations before permanent utility connections completed.
Solaris Energy Infrastructure owns a fleet of 600 megawatts of gas turbines, with approximately 400 megawatts currently serving xAI.¹⁸ xAI represents 67% of Solaris's 1,700-megawatt order book, totaling 1,140 megawatts.¹⁹ Solaris expects to have over 1.1 gigawatts of fully operating turbines for xAI by Q2 2027.²⁰
The Colossus 2 expansion at the Tulane Road site includes at least 110,000 NVIDIA GB200 GPUs carrying a power load around 170 megawatts.²¹ Additional Megapacks and turbine capacity support the expanded footprint.
xAI received permits for gas-burning turbines to power the supercomputer.²² The permit expires in 2027, by which time xAI intends to rely on multiple power sources including two MLGW substations financed and built on the Colossus campus.²³ xAI plans to break ground on a 500-acre solar farm near the site.²⁴
Cooling systems and water infrastructure
From the start, xAI trucked in water and recycled it through an internal closed-loop system to cool the supercomputer.²⁵ The unconventional approach enabled operations before permanent water infrastructure completed. xAI committed to build an $80 million wastewater recycling facility to address long-term water needs.²⁶
The company plans the world's largest ceramic membrane bioreactor wastewater recycling plant.²⁷ Once complete, the facility will protect an estimated 4.745 billion gallons of aquifer water.²⁸ A massive graywater cooling tower under construction will pipe cooled recycled water into Colossus from the nearby graywater plant.²⁹
Colossus 2 uses a hybrid cooling approach. Approximately half of the cooling comes from xAI's graywater facility while the other half uses air cooling.³⁰ By August 2025, 119 air-cooled chillers provided roughly 200 megawatts of cooling capacity, enough for approximately 110,000 GB200 NVL72 GPUs.³¹
During the initial construction phase, xAI leased generators and approximately a quarter of the US mobile cooling capacity to kickstart operations quickly.³² The aggressive procurement of temporary infrastructure enabled the compressed timeline while permanent systems completed.
Spectrum-X Ethernet networking
Unlike most AI training clusters that use InfiniBand, xAI's Colossus uses NVIDIA's Spectrum-X Ethernet platform for its RDMA network.³³ The choice demonstrates that Ethernet can support the largest AI training clusters when properly configured.
Colossus uses the 51.2 terabits-per-second Spectrum SN5600, which provides 64 800-gigabit Ethernet ports in a 2U form factor.³⁴ Individual nodes use NVIDIA's BlueField-3 SuperNICs featuring a single 400-gigabit connection to each GPU.³⁵
The network achieved zero application latency degradation or packet loss due to flow collisions across all three tiers of the fabric.³⁶ The system maintained 95% data throughput enabled by Spectrum-X congestion control.³⁷ Standard Ethernet typically delivers only 60% throughput at this scale due to thousands of flow collisions.³⁸
Traditional Ethernet networks struggle with incast problems when thousands of GPUs communicate simultaneously.³⁹ InfiniBand traditionally solved this with built-in Priority Flow Control and hardware-level congestion management.⁴⁰ Spectrum-X achieves similar results using RoCE v2 with enhanced congestion control mechanisms.⁴¹
The Ethernet approach provides cost benefits and flexibility compared to InfiniBand while maintaining performance. Spectrum-X features including adaptive routing with Direct Data Placement technology, congestion control, and enhanced AI fabric visibility enable InfiniBand-like performance on Ethernet infrastructure.⁴²
Scale comparison
Colossus at 200,000 GPUs exceeds other major supercomputers by substantial margins.⁴³ Oracle's zettascale AI supercomputer contains 131,072 NVIDIA GPUs.⁴⁴ Lawrence Livermore National Laboratory's El Capitan has 44,544 GPUs.⁴⁵ Oak Ridge National Laboratory's Frontier has 37,632 GPUs.⁴⁶
According to xAI's specifications, Colossus achieves total memory bandwidth of 194 petabytes per second with storage capacity exceeding one exabyte.⁴⁷ The memory bandwidth enables the collective operations that AI training requires across hundreds of thousands of GPUs.
The cluster trains xAI's Grok chatbot and provides computing support to X and other Musk ventures including SpaceX.⁴⁸ The multi-purpose utilization justifies the infrastructure investment across multiple business lines.
Colossus 2 expansion
xAI kicked off the Colossus 2 project on March 7, 2025, acquiring a 1-million-square-foot warehouse in Memphis plus two adjacent sites totaling 100 acres.⁴⁹ The Tulane Road site will host the expanded GPU fleet.
The expansion targets 350,000 GPUs with the world's largest deployment of Tesla Megapack batteries for backup power during high grid loads.⁵⁰ The site will feature 60 to 70 Megapacks alongside the GPU infrastructure.⁵¹
The Memphis Chamber of Commerce claims xAI intends to expand to 1 million GPUs total.⁵² Achieving that scale requires continued power infrastructure development beyond current capacity. The 1.1 gigawatts Solaris plans for 2027 would support approximately half a million high-power GPUs at current density levels.
Infrastructure lessons
The Colossus project demonstrates several approaches that accelerate AI infrastructure deployment.
Facility reuse can compress timelines dramatically. Finding an existing industrial facility with power infrastructure in place eliminated construction time that new builds require. Organizations with access to decommissioned industrial facilities may find opportunities for rapid AI infrastructure deployment.
Temporary infrastructure enables parallel paths. Leasing generators, mobile cooling, and trucking water allowed operations to begin while permanent infrastructure completed. The cost premium for temporary solutions may prove worthwhile when time-to-operation determines competitive position.
Ethernet can support the largest clusters. The Spectrum-X deployment proves that InfiniBand is not required for massive-scale AI training. Organizations with Ethernet expertise and infrastructure may not need to adopt InfiniBand for even the largest deployments.
Power remains the primary constraint. Despite creative solutions including battery storage, gas turbines, and accelerated substation construction, power availability limited the speed and scale of deployment. Organizations planning large AI clusters should secure power capacity first.
The tradeoffs include regulatory challenges, community relations issues, and technical risks from compressed timelines. xAI's permit for gas turbines expires in 2027, creating transition requirements.⁵³ Local officials expressed concerns about limited visibility into xAI's operations.⁵⁴ The speed that enables competitive advantage may create technical debt that slower deployments avoid.
Quick reference: Colossus specifications
| Specification | Value |
|---|---|
| Total GPUs | 200,000+ (150K H100, 50K H200, 30K GB200) |
| Build time | 122 days (Phase 1), 92 days (Phase 2) |
| Power consumption | 250 MW current |
| Power infrastructure | 35 gas turbines (420 MW), 208 Tesla Megapacks |
| Networking | NVIDIA Spectrum-X 800G Ethernet |
| Storage | >1 exabyte |
| Memory bandwidth | 194 PB/s |
| Rack configuration | 64 GPUs per rack, 1,500 racks |
| Cooling | Liquid cooling + graywater recycling |
| Expansion target | 1 million GPUs |
Key takeaways
For infrastructure leaders: - Traditional DC quotes: 18-24 months; xAI delivered in 122 days using facility reuse - Temporary infrastructure (leased generators, mobile cooling, trucked water) enables parallel paths - Power remains the primary constraint—secure capacity before GPU procurement - Spectrum-X Ethernet proved viable at 200K GPU scale, challenging InfiniBand necessity
For facilities teams: - Decommissioned industrial facilities offer rapid deployment opportunities - 250 MW requires multiple power sources—gas turbines, batteries, utility substations - Graywater recycling addresses water concerns at scale—$80M facility protects 4.7B gallons aquifer - 119 air-cooled chillers provide ~200 MW cooling capacity
For strategic planning: - Speed vs. sustainability tradeoff: gas turbine permits expire 2027 - Compressed timelines create technical debt that methodical deployments avoid - Multi-purpose utilization (Grok, X, SpaceX) justifies infrastructure investment - 1 million GPU target requires 1.1+ GW power—beyond current grid capacity
The Colossus project establishes a new baseline for what aggressive AI infrastructure deployment can achieve. Organizations should understand what the fastest possible path looks like, even if they choose more measured approaches for their own deployments.
References
-
Supermicro. "Inside the 100K GPU xAI Colossus Cluster." 2024. https://www.supermicro.com/CaseStudies/Success_Story_xAI_Colossus_Cluster.pdf
-
HPCwire. "Colossus AI Hits 200,000 GPUs as Musk Ramps Up AI Ambitions." May 2025. https://www.hpcwire.com/2025/05/13/colossus-ai-hits-200000-gpus-as-musk-ramps-up-ai-ambitions/
-
SemiAnalysis. "xAI's Colossus 2 - First Gigawatt Datacenter In The World." September 2025. https://semianalysis.com/2025/09/16/xais-colossus-2-first-gigawatt-datacenter/
-
Data Center Dynamics. "Elon Musk's xAI targets one million GPUs for Colossus supercomputer in Memphis." 2025. https://www.datacenterdynamics.com/en/news/xai-elon-musk-memphis-colossus-gpu/
-
Wikipedia. "Colossus (supercomputer)." 2025. https://en.wikipedia.org/wiki/Colossus_(supercomputer)
-
Wikipedia. "Colossus (supercomputer)."
-
R&D World Online. "How xAI turned a factory shell into an AI 'Colossus' for Grok 3." 2025. https://www.rdworldonline.com/how-xai-turned-a-factory-shell-into-an-ai-colossus-to-power-grok-3-and-beyond/
-
Fortune. "Super Micro CEO Charles Liang said he teamed up with Elon Musk's xAI to build the gargantuan Colossus data center in just 122 days." March 2025. https://fortune.com/2025/03/14/super-micro-ceo-charles-liang-elon-musk-xai-grok-nvidia-server-chips/
-
Wikipedia. "Colossus (supercomputer)."
-
ServeTheHome. "Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk." 2024. https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/
-
NVIDIA Newsroom. "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI." 2024. https://nvidianews.nvidia.com/news/spectrum-x-ethernet-networking-xai-colossus
-
Built In. "What Is the xAI Supercomputer (Colossus)?" 2025. https://builtin.com/artificial-intelligence/xai-supercomputer-colossus
-
Wikipedia. "Colossus (supercomputer)."
-
R&D World Online. "How xAI turned a factory shell into an AI 'Colossus.'"
-
Infinity Turbine. "The Xai Colossus MegaCluster in Memphis TN Featuring 100000 Nvidia H100." 2025. https://infinityturbine.com/xai-colossus-cluster-memphis-tn-nvida-h100-by-infinity-turbine.html
-
Greater Memphis Chamber. "xAI." 2025. https://memphischamber.com/economic-development/xai/
-
Infinity Turbine. "The Xai Colossus MegaCluster."
-
SemiAnalysis. "xAI's Colossus 2."
-
SemiAnalysis. "xAI's Colossus 2."
-
SemiAnalysis. "xAI's Colossus 2."
-
SemiAnalysis. "xAI's Colossus 2."
-
CNBC. "Musk's xAI scores permit for gas-burning turbines to power Grok supercomputer in Memphis." July 2025. https://www.cnbc.com/2025/07/03/musks-xai-gets-permit-for-turbines-to-power-supercomputer-in-memphis.html
-
Local Memphis. "xAI officials ensure stable power for Shelby County as they prepare to fire up Colossus 2." 2025. https://www.localmemphis.com/article/news/local/xai-says-stable-power-for-shelby-county-colossus/522-54334b93-6d25-49c3-89b8-563437797adc
-
Local Memphis. "xAI officials ensure stable power for Shelby County."
-
Greater Memphis Chamber. "xAI."
-
Greater Memphis Chamber. "xAI."
-
Greater Memphis Chamber. "xAI."
-
Greater Memphis Chamber. "xAI."
-
Action News 5. "Inside Look: Action News 5's Joe Birch tours growing xAI data center." August 2025. https://www.actionnews5.com/2025/08/18/inside-look-action-news-5s-joe-birch-tours-growing-xai-data-center/
-
SemiAnalysis. "xAI's Colossus 2."
-
SemiAnalysis. "xAI's Colossus 2."
-
R&D World Online. "How xAI turned a factory shell into an AI 'Colossus.'"
-
Data Center Dynamics. "xAI to double Colossus compute capacity, reveals cluster uses Nvidia Spectrum-X ethernet." 2024. https://www.datacenterdynamics.com/en/news/xai-to-double-colossus-compute-capacity-reveals-cluster-uses-nvidia-spectrum-x-ethernet/
-
The Register. "xAI's 100,000 H100 Colossus is glued together using Ethernet." October 2024. https://www.theregister.com/2024/10/29/xai_colossus_networking/
-
The Register. "xAI's 100,000 H100 Colossus."
-
NVIDIA Newsroom. "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer."
-
NVIDIA Newsroom. "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer."
-
NVIDIA Newsroom. "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer."
-
Storage Review. "NVIDIA Spectrum-X Networking Powers xAI's Colossus Supercomputer." 2024. https://www.storagereview.com/news/nvidia-spectrum-x-networking-powers-xais-colossus-supercomputer
-
Storage Review. "NVIDIA Spectrum-X Networking Powers xAI's Colossus Supercomputer."
-
Storage Review. "NVIDIA Spectrum-X Networking Powers xAI's Colossus Supercomputer."
-
NVIDIA Newsroom. "NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer."
-
Built In. "What Is the xAI Supercomputer (Colossus)?"
-
Built In. "What Is the xAI Supercomputer (Colossus)?"
-
Built In. "What Is the xAI Supercomputer (Colossus)?"
-
Built In. "What Is the xAI Supercomputer (Colossus)?"
-
xAI. "Colossus." 2025. https://x.ai/colossus/
-
Wikipedia. "Colossus (supercomputer)."
-
SemiAnalysis. "xAI's Colossus 2."
-
Baxtel. "xAI: Memphis Data Center." 2025. https://baxtel.com/data-center/xai-memphis
-
SemiAnalysis. "xAI's Colossus 2."
-
Data Center Dynamics. "Elon Musk's xAI targets one million GPUs."
-
CNBC. "Musk's xAI scores permit for gas-burning turbines."
-
Data Center Dynamics. "Councillors in the dark over Elon Musk's xAI Memphis data center." 2025. https://www.datacenterdynamics.com/en/news/we-dont-know-anything-councillors-in-the-dark-over-elon-musks-xai-memphis-data-center/