xAI's Colossus 2 supercomputer in Memphis now operates 300,000 GPUs. The cluster targets 1 million GPUs by end of 2026, per Epoch AI.
Key facts
- 300,000 GPUs currently operational in Colossus 2.
- 1 million GPU target by end of 2026.
- $6 billion estimated hardware cost.
- 150 megawatts power consumption.
- Training Grok-3 already underway.
xAI's Colossus 2 supercomputer in Memphis now operates 300,000 GPUs, according to Epoch AI. The cluster targets 1 million GPUs by end of 2026, making it one of the largest AI training infrastructures ever built. Training runs for Grok-3 are already underway, though xAI has not disclosed performance metrics.
Key Takeaways
- xAI's Colossus 2 hits 300,000 GPUs, targeting 1M by year-end.
- Training Grok-3, the $6B cluster challenges OpenAI and Google.
Scale and Cost

The build-out cost an estimated $6 billion for hardware alone, with cooling demands requiring 150 megawatts of power. This dwarfs competing clusters: Google's TPU v5p pods top out at 8,960 chips per unit, while Microsoft's planned 2027 cluster targets 100,000 GPUs. Colossus 2's density is achieved through direct liquid cooling and a custom InfiniBand fabric.
Competitive Implications
xAI is racing to close the compute gap with OpenAI and Google. The 300,000-GPU milestone gives it roughly 3x the raw FLOPS of OpenAI's current largest cluster, though efficiency depends on interconnect topology and model architecture. The unique take: xAI is betting that brute-force scaling still works for frontier models, even as competitors like Google push efficiency via sparse MoE and distillation. If Grok-3 matches GPT-5.5 on benchmarks like SWE-Bench or MATH, it will validate the scaling-only approach. If not, the $6B hardware spend will look like a bet on a fading paradigm.
Infrastructure Challenges

Operating 300,000 GPUs requires solving reliability at unprecedented scale. Epoch AI notes that mean time between failures for GPU clusters drops non-linearly beyond 100,000 units. xAI has not published uptime data, but sources familiar with the build say redundancy is built into every rack with hot-swappable power supplies and redundant network paths. Memphis was chosen for its low energy costs and proximity to a TVA substation capable of 200 MW.
What to watch
Watch for Grok-3 benchmark scores on SWE-Bench and MATH by Q3 2026. If xAI publishes scaling laws data, it will reveal whether brute-force GPU stacking still beats algorithmic efficiency.
Source: news.google.com









