Compute Lab

AI data centers, power, GPUs and the build-out of frontier compute.

Live Technical IntelligenceFrom SemiAnalysis · The Next Platform · DCD · HPCwire

AI Data Centers.
Build them. Operate them. Understand them.

Stargate is racing to 7 GW across five new sites. Meta's 5 GW Hyperion in Louisiana spans 11 buildings. xAI's Colossus 2 is targeting 1 million GPUs. Microsoft Fairwater is projected past $100 B. The bottleneck is no longer money or chips — it’s power, grid interconnects, and gas-turbine wait times. This hub is everything we know about the build-out: a 12-lesson technical curriculum, a 205-term glossary, hands-on calculators, named-campus case studies, and live news from the people actually building it.

$29B

Per gigawatt

Typical capex for a 1 GW AI campus (2026)

5 GW

Largest AI campus

Meta Hyperion, Louisiana — 11 buildings

1.6 GW+

xAI Colossus 2

Memphis · 550K-1M GPUs targeted by 2026

1.10 PUE

Hyperscale target

Microsoft Fairwater · closed-loop liquid cooling

Start Learning ✨ New here? Start in 2 min 🛠️ Try the Designer Latest News Glossary

Compute Lab brief · Jun 2

What changed this week

Auto-generated from the Lab’s knowledge graph. Findings are produced every 12 hours by our agentic research pipeline.

Open the Lab brain →

Stories that moved the graph

MiniMax Claims 26% BU Bench Gain, Details Scarce

MiniMax claimed 26% BU Bench improvement without paper or code. Unverifiable claim reduces credibility.

relevance 83/100·Jun 1

SemiAnalysis Calls Jensen ComputeX Keynote 'F Tier' Over No AI DC News

SemiAnalysis rated Jensen Huang's ComputeX keynote 'F Tier' for no AI datacenter news and revealed a delayed NVIDIA ARM chip with broken video output.

relevance 82/100·Jun 1

Nvidia Unveils New Windows SoC, Targeting AI PCs

Nvidia announced a Windows SoC for AI PCs, per @mweinbach. Chip targets on-device inference, competing with Qualcomm and Intel.

relevance 99/100·Jun 1

The 12-Lesson Curriculum

From "what's a rack" to "design a 100MW AI campus" — learn at your own pace.

View all 12 →

01Beginner

01 · Data Center Fundamentals

What a data center actually is, the four-layer Tier classification (Uptime Institute), the components inside a single rack, and why AI changed everything.

12 min read · 4 diagramsRead →

02Beginner

02 · Power Infrastructure

From the utility substation to the chip: high-voltage interconnects, UPS systems, generators, PDUs, and the 100MW+ scale that AI demands.

15 min read · 5 diagramsRead →

03Intermediate

03 · Cooling Systems

Air, liquid, immersion. CRAC vs CDU, direct-liquid cooling for GPUs, PUE/WUE math, and why every modern AI rack is liquid-cooled.

14 min read · 6 diagramsRead →

04Intermediate

04 · Compute & Accelerators

NVIDIA H100/H200/B200/GB200 NVL72, AMD MI300X, Google TPU v5p, AWS Trainium2, Cerebras WSE-3. Real specs, real interconnects.

18 min read · 5 diagramsRead →

05Intermediate

05 · Network Fabric

InfiniBand vs Ethernet (Ultra Ethernet Consortium), NVLink/NVSwitch, optical transceivers, CLOS topology, and rail-optimized layouts.

16 min read · 5 diagramsRead →

06Intermediate

06 · Storage Architecture

Parallel filesystems (Lustre, WekaFS, VAST), NVMe-oF, checkpoint strategies, and how 100k-GPU clusters move 100GB/s.

13 min read · 3 diagramsRead →

07Advanced

07 · Software & Orchestration

SLURM vs Kubernetes for AI, Run.AI, NVIDIA Base Command, gang scheduling, fault tolerance, and the orchestration stack on top of bare metal.

14 min read · 3 diagramsRead →

08Advanced

08 · How to Build One

Site selection, permitting, 18-36 month construction timelines, vendor selection, and the realistic capex of a 100MW AI campus.

20 min read · 4 diagramsRead →

09Advanced

09 · Operating a Data Center

DCIM, BMS, capacity planning, incident response, the day-to-day of running a critical facility at 99.99% uptime.

13 min read · 3 diagramsRead →

10Intermediate

10 · Sustainability

PUE/WUE/CUE, hyperscaler net-zero pledges, geothermal partnerships, heat reuse for district heating, water positivity.

12 min read · 3 diagramsRead →

11Advanced

11 · Economics & Financing

$/MW capex, opex breakdown, neocloud business models (CoreWeave, Lambda, Crusoe), depreciation cycles, and why CapEx is exploding.

14 min read · 3 diagramsRead →

12Beginner

12 · Careers & How to Become an Expert

Roles, salaries, certifications (Uptime Institute CDCP/CDCS/CDCE, BICSI RCDD), training programs, and the career ladder.

11 min read · 2 diagramsRead →

View all 12 →

🧠 Living Agent · Autonomous

Live AI Infra Intelligence

Refreshed every 12h by the Living Agent — analyzing fresh DC news for operator velocity, tech trends, and weekly shifts. Last updated 7h ago.

📝 What changed this week

CoreWeave receives first Nvidia Vera Rubin NVL72 rack — Dell ships production unit ahead of schedule, signaling Nvidia’s accelerated next-gen platform transition. Implication: Vera Rubin NVL72 may compress Blackwell’s lifespan, altering hyperscale procurement cycles and secondary-market GPU availability.
xAI abandons JAX, builds custom C training framework — Decision driven by sub-10% Model FLOPs Utilization (MFU) on existing stacks. Implication: Major hyperscaler is now vertically integrating training software, potentially fragmenting the ML compiler ecosystem and pressuring JAX/PyTorch maintainers to close performance gaps.
Blackwell NVLink breaks confidential compute, 61% regression — NVLink integrity checks expose large overhead when enabling TEE. Implication: Multi-tenant GPU clusters for sensitive workloads (finance, healthcare) face severe throughput penalties, possibly delaying Blackwell adoption in regulated verticals.
Huawei achieves 1.5µm bond pitch on Kirin 2026 — Beats TSMC’s 1.8µm in hybrid bonding. Implication: Chinese AI chip supply chain may bypass leading-edge lithography constraints via advanced packaging, threatening Nvidia’s domestic market share and prompting export control revisions.
ERCOT data center requests exceed grid capacity by 5x — Texas grid faces unprecedented interconnection backlog. Implication: Hyperscalers may pivot to on-site generation (gas turbines, small modular reactors) or shift buildout to less constrained regions, reshaping DC site-selection economics.
AI data centers hit 2M gallons per day per campus water wall — Cooling demand strains municipal water supplies. Implication: Operators face regulatory pushback and rising water costs; adoption of direct-to-chip liquid cooling and closed-loop systems becomes a competitive necessity, not an option.

🚀 Trending hardware/tech

GB200 NVL72×1

B200×1

Cerebras WSE-3×1

Curated by an autonomous agent reading live RSS + entity mentions. Rankings reflect actual coverage frequency, not editorial choice.

Latest Technical News

Filtered for substance: hardware specs, topology, MW, MFU. Press fluff demoted.

All news →

Jensen Huang onstage at ComputeX 2026, gesturing to a blank screen as disappointed attendees check phones, no AI…

Opinion & Analysis

SemiAnalysis Calls Jensen ComputeX Keynote 'F Tier' Over No AI DC News

SemiAnalysis rated Jensen Huang's ComputeX keynote 'F Tier' for no AI datacenter news and revealed a delayed NVIDIA ARM chip with broken video output.

x.com/20h ago/3 min read

arm chipsai hardwarenvidia

Dell employee stands next to a tall, densely packed server rack labeled Nvidia Vera Rubin NVL72, being loaded for…

Products & Launches

100

Dell Ships First Nvidia Vera Rubin NVL72 Rack to CoreWeave

Dell delivered the first Nvidia Vera Rubin NVL72 rack to CoreWeave. Each rack packs 72 Rubin GPUs, 36 Vera CPUs, 3.6 exaFLOPS FP4 inference, 75 TB memory, and 260 TB/s NVLink bandwidth.

x.com/1d ago/3 min read/Widely Reported

dellhardwarecoreweave

A close-up of an NVIDIA Blackwell GPU with NVLink connectors, paired with a performance chart showing a 61%…

AI Research

100

Blackwell NVLink Breaks Confidential Compute, 61% Regression Reported

NVIDIA Blackwell confidential computing disables NVLink multicast, causing 61% regression on SGLang Qwen3.5 397B. Hopper had unencrypted NVLink, compounding the issue.

x.com/3d ago/3 min read/Multi-Source

ai inferencehardwaresecurity

Aerial view of a large Texas power substation with transformers and transmission lines under a clear blue sky

Products & Launches

ERCOT datacenter requests exceed grid capacity by 5x

ERCOT datacenter requests far exceed grid underwriting capacity, per @SemiAnalysis_, revealing grid approval as a binding constraint on AI infrastructure buildout.

x.com/3d ago/3 min read

ai infrastructuredata centersenergy

A large data center building with condensation on its glass windows, surrounded by dry landscape, highlighting the…

Policy & Ethics

AI Data Centers Hit Water Wall: 2M Gallons Per Day Per Campus

Water capacity is now a siting gatekeeper for AI data centers. A Virginia campus requested 2M gallons per day; Georgia told a 6 MGD project 'we just don't have the water.'

datacenterknowledge.com/3d ago/3 min read/Widely Reported

ai infrastructurecoolingdata centers

Two business executives shaking hands in a modern glass-walled office, with a digital server rack and glowing…

Products & Launches

Google and Blackstone Launch TPU Venture, Challenging Nvidia Dominance

Google and Blackstone launched a TPU venture, financing AI infrastructure outside the hyperscale cloud model. Enterprise buyers get a standalone alternative to Nvidia-dominated GPU clusters.

news.google.com/May 21, 2026/3 min read/Widely Reported

ai infrastructurehardwarecloud computing

Top AI Data Center Operators

Hyperscalers, neoclouds, and colocation providers powering frontier AI.

Operator	Type	Power (MW)	Notable Site	Specialty
Microsoft Azure	Hyperscaler	5,000	Fairwater, WI · projected $100B+ build	OpenAI compute partner · GB200 at scale
Google Cloud	Hyperscaler	4,500	Council Bluffs, IA · The Dalles, OR · Kronstorf, AT	TPU pods + Gemini training · 5GW Anthropic deal
Amazon AWS	Hyperscaler	4,000	Project Rainier, New Carlisle IN · 2.2 GW	Trainium2/3 powering Anthropic
Meta	Hyperscaler	5,000	Hyperion, Richland Parish LA · 5 GW · 11 buildings	Llama + MTIA · Prometheus, Ohio coming May 2026
OpenAI / Stargate	Hyperscaler	7,000	Abilene, TX · 1.2 GW by mid-2026 · 5 new sites announced	Oracle + SoftBank · 10 GW by 2027 · ~$400B committed
xAI / SpaceX	Hyperscaler	1,600	Colossus 2, Memphis TN · 550K-1M GPUs · Colossus 1 leased to Anthropic	Grok training · Vera Rubin roadmap
Anthropic	Hyperscaler	5,300	Project Rainier (AWS) · Colossus 1 (SpaceX) · Fluidstack	Multi-vendor compute · 5 stacked deals
CoreWeave	Neocloud	1,300	Plano, TX · multiple sites	GPU-as-a-service, NVIDIA partner
Equinix	Colocation	1,500	Global · 260+ data centers	Interconnection + colo
Digital Realty	Colocation	2,700	Global · 300+ data centers	Wholesale + hyperscale colo
Lambda	Neocloud	200	Allen, TX · expanding	GPU cloud, on-demand H100/H200
Crusoe Energy	Neocloud	1,200	Abilene, TX (Stargate Phase 1)	Stranded-gas powered AI infra

MW figures are publicly disclosed AI-dedicated capacity, current or planned. Updated continuously from press releases, permit filings, and infrastructure analysis.

Operator momentum

Top 8 DC operators by mention growth this week. Sourced from the Compute Lab brief.

Full graph view →

01Nvidia
4 mentions
02Intel
2 mentions
03CoreWeave
1 mentions
04xAI
1 mentions

✨ NEW · Interactive

Stop reading. Start designing.

The Data Center Designer simulator. Pick scale (1 MW → 2.5 GW), GPU (H100 / B200 / GB200 NVL72 / MI300X / TPU / Trainium), cooling, location, tier — see the real capex, opex, PUE, training throughput, build timeline, and CO₂ math your design produces. Six presets matched to real projects (Stargate, Hyperion, Project Rainier).

Open the Designer →+ Cloud lab pathway · First exercise <$5

120 kW/rack

PUE 1.10

$1.2B capex

1.4 EFLOPS

~28-month build

Named campuses, decoded

Deep dives on the real gigawatt-scale projects — verified specs, cited sources, strategic analysis.

All 4 case studies →

🏭

xAI Colossus

2 GW target · Memphis TN

122-day build

🚀

Stargate Abilene

1.2 GW · 450K GB200

OpenAI × Oracle

🏔️

Project Rainier

2.2 GW · 500K Trainium2

Anthropic × AWS

🌊

Meta Hyperion

5 GW · $27B · Louisiana

10 gas plants

Hands-on tools

Every tool is interactive, browser-only, no signup. Built to teach by doing.

Practice index →

🛠️

Data Center Designer

Pick scale + GPU + cooling + location. See live capex/opex/PUE.

🔧

PUE Calculator

20 cities. Real climate data. Realistic PUE/WUE/CUE.

⚡

Power Budget Game

Train Llama-405B in 90 days. Can you fit it?

🚨

Incident Scenarios

3 AM page. Coolant leak. What do you do?

🧮

Capacity Planning

Power vs cooling vs space — find the bottleneck.

🆚

GPU Comparator

H100 vs B200 vs MI300X vs Trainium2 — real specs.

Who is this for?

We have tailored reading paths for 4 audiences. Pick yours.

👨‍💻

Software engineer

📊

Investor / analyst

🏗️

Facilities engineer

🏛️

Policymaker / journalist

✨ Start Here — 2-min beginner intro·❓ FAQ (30 questions)·📅 Timeline·📚 Further reading

Speak the language

From PUE to NVLink — the vocabulary you need to read any data center paper.

View all 200+ →

PUE— Power Usage Effectiveness

Total facility power ÷ IT equipment power. 1.0 = perfect, 1.10 = hyperscale, 1.5+ = enterprise.

DLC— Direct Liquid Cooling

Coolant flows through cold plates touching the chip. Required for Blackwell-class racks above 70kW.

NVLink— NVIDIA's GPU interconnect

5th gen on Blackwell: 1.8 TB/s bidirectional per GPU. Connects GPUs into a single-memory domain.

InfiniBand— High-speed scale-out fabric

NDR = 400 Gbps, XDR = 800 Gbps per port. Dominates scale-out networks for AI training.

HBM— High Bandwidth Memory

Stacked DRAM next to the GPU die. H100 = HBM3 (3.35 TB/s), B200 = HBM3e (8 TB/s).

MFU— Model FLOPs Utilization

% of theoretical peak FLOPs your training run actually achieves. 50%+ is great.

CDU— Coolant Distribution Unit

Heat exchanger between the rack's liquid loop and the facility loop. Sits at row or rack level.

Tier IV— Uptime Institute rating

Fault-tolerant: every component is redundant + concurrently maintainable. 99.995% uptime target.

GB200 NVL72— NVIDIA Blackwell rack-scale system

72 B200 GPUs + 36 Grace CPUs in one liquid-cooled rack. ~120 kW. 1.4 EFLOPS FP4.

Become an expert

Real-world courses and certifications used by the people building the largest AI clusters on Earth.

All courses →

EPI (EXIN-accredited)~$2,000–2,500

Certified Data Centre Professional (CDCP®)

The de-facto entry credential for data center facilities. EXIN-accredited, valid 3 years. 40-question exam (27/40 to pass). Delivered in 50+ countries via partners.

2 days + 1-hour exam · Beginner→

Uptime Institute$4,985

Accredited Tier Designer (ATD)

The credential for designing Tier-rated facilities. PE licence (or equivalent) required. What hyperscalers and MEP firms actually look for.

16 hours over 5 half-days + proctored exam · Advanced→

NVIDIAFree audits + paid workshops (~$500)

NVIDIA Deep Learning Institute (DLI)

CUDA, multi-node training, NCCL, Base Command. Paid courses include live GPU labs. Maps to NCA-AIIO ($125) and NCP-AII ($400) certifications.

Self-paced + 8h instructor-led · All levels→

BICSI$510 member / $725 non-member (exam)

Data Center Design Consultant (DCDC®)

BICSI's data-center–specific design credential. 100 questions, drag-and-drop + multiple choice. Requires RCDD or 3 years DC experience. Pearson VUE delivered.

Self-study + 2-hour computer-based exam · Advanced→

Schneider ElectricFree

Schneider Electric University (formerly DCU)

200+ vendor-neutral modules on power, cooling, racks, design, sustainability. CPD-accredited. Optional DCCA (Data Center Certified Associate) exam.

Self-paced, ~1h modules · Beginner→

Open Compute ProjectFree

OCP Academy

Official learning platform for Open Compute Project specs. Modules include 'Open Systems for AI' (6-part series), Open Rack ORv3, and OCP-Recognized Equipment.

Self-paced · Advanced→

Connect the labs

Compute Lab is the steel. Click through to see what runs on it.

🤖

Agents Lab

Which DCs run today's top agents?

💼

Talent Lab

DC operators, infra hires & skill curves.

🦾

Code Lab

Coding agents that train on these clusters.

💎

Retail Lab

Where luxury houses are spending GPU budget.

Frequently asked questions

What is an AI data center, and how is it different from a traditional one?

An AI data center is a facility purpose-built to train and serve large AI models. Unlike traditional cloud or enterprise data centers, AI sites are GPU-dense (NVIDIA H100/H200/B200, AMD MI300X), wired with InfiniBand or NVLink fabrics for ultra-low-latency all-reduce traffic, and almost always liquid-cooled because rack densities exceed 100 kW. They also draw far more power per rack — frontier campuses now run at 100 MW–1 GW, comparable to small cities — and prioritize uptime under sustained training loads measured in weeks.

How much power does an AI data center use?

Modern AI training campuses run between 100 MW and 1 GW. NVIDIA's GB200 NVL72 rack alone draws ~120 kW. A 100 MW AI facility supports roughly 800–1,000 such racks. Stargate (OpenAI/Oracle/SoftBank) targets 5+ GW across multiple sites. Hyperion (Meta), Project Rainier (Amazon), and xAI's Colossus are all in the multi-hundred-MW class. The IEA forecasts global data-center electricity demand could double by 2030 if AI growth continues at the current pace, with AI representing ~40% of the increase.

Why do AI data centers need liquid cooling?

Air cooling tops out at roughly 30–40 kW per rack. AI racks like NVL72 and B200 reference designs go beyond 100 kW. Liquid carries heat ~3,000× more efficiently than air per unit volume, so direct-to-chip cold plates and immersion are the only practical options at GPU densities above 50 kW. Liquid cooling also lets operators tighten PUE (Power Usage Effectiveness) toward 1.1, recovering a meaningful fraction of waste heat. Vertiv, Schneider Electric, and CoolIT supply most of the modern AI cooling stack.

What is InfiniBand, and why does it matter for AI training?

InfiniBand is a low-latency, high-bandwidth networking fabric (NVIDIA Quantum-2 = 400 Gb/s; Quantum-X800 = 800 Gb/s) used to connect GPUs across thousands of nodes for synchronous training. AI models train via all-reduce gradients across the cluster every step — a slow fabric throttles the entire run. Ethernet (Ultra Ethernet Consortium, NVIDIA Spectrum-X) is closing the gap, but InfiniBand still dominates frontier training clusters because of predictable tail latency under congestion.

Who are the top AI data center operators in 2026?

Hyperscalers: Microsoft Azure, Amazon AWS, Google Cloud, Meta. Neoclouds (GPU-focused): CoreWeave, Crusoe, Lambda, Nebius, Voltage Park. Sovereign/regional: G42 (UAE), Stargate (US), Mistral/Iliad (EU), Yotta (India). Colocation: Equinix, Digital Realty, Vantage. The neocloud category exploded in 2025-2026 as AI labs sought bare-metal H200/B200 capacity outside the hyperscaler queue. Frontier compute increasingly lives in greenfield builds near cheap power (Texas, Iowa, the Nordics, the Gulf).

Is it true AI will exceed grid capacity?

In several U.S. and EU regions it already has. ERCOT (Texas), PJM (Mid-Atlantic), and Ireland's EirGrid have all paused or capped new data-center interconnects. Operators are responding with on-site gas generation, behind-the-meter solar+battery, and direct PPAs with new nuclear (Microsoft–Constellation Three Mile Island, Amazon–Talen, Google–Kairos SMRs). The bottleneck is not silicon — it's substations, transformers, and transmission. Grid queue times of 4–7 years now exceed GPU lead times.

How do I learn how AI data centers actually work?

Start with our 12-lesson curriculum at /ai-data-centers/learn — it covers racks, power, cooling, networking, and capacity planning, ending with a 100 MW campus design exercise. The /ai-data-centers/glossary covers 200+ terms (PUE, NVL72, Tier ratings, MFU, BBU). For deeper reading, follow SemiAnalysis, The Next Platform, Data Center Dynamics (DCD), and HPCwire — these are also the sources we ingest live for the news section on this page.

Get the data center briefing

Weekly: only the technical news that matters. New papers, MW build-outs, topology decisions, hardware drops.

Subscribe →

AI Data Centers.Build them. Operate them. Understand them.

What changed this week

The 12-Lesson Curriculum

01 · Data Center Fundamentals

02 · Power Infrastructure

03 · Cooling Systems

04 · Compute & Accelerators

05 · Network Fabric

06 · Storage Architecture

07 · Software & Orchestration

08 · How to Build One

09 · Operating a Data Center

10 · Sustainability

11 · Economics & Financing

12 · Careers & How to Become an Expert

Live AI Infra Intelligence

Latest Technical News

SemiAnalysis Calls Jensen ComputeX Keynote 'F Tier' Over No AI DC News

Dell Ships First Nvidia Vera Rubin NVL72 Rack to CoreWeave

Blackwell NVLink Breaks Confidential Compute, 61% Regression Reported

ERCOT datacenter requests exceed grid capacity by 5x

AI Data Centers Hit Water Wall: 2M Gallons Per Day Per Campus

Google and Blackstone Launch TPU Venture, Challenging Nvidia Dominance

Top AI Data Center Operators

Operator momentum

Stop reading. Start designing.

Named campuses, decoded

xAI Colossus

Stargate Abilene

Project Rainier

Meta Hyperion

Hands-on tools

Data Center Designer

PUE Calculator

Power Budget Game

Incident Scenarios

Capacity Planning

GPU Comparator

Who is this for?

Speak the language

Become an expert

Certified Data Centre Professional (CDCP®)

Accredited Tier Designer (ATD)

NVIDIA Deep Learning Institute (DLI)

Data Center Design Consultant (DCDC®)

Schneider Electric University (formerly DCU)

OCP Academy

Connect the labs

Agents Lab

Talent Lab

Code Lab

Retail Lab

Frequently asked questions

Get the data center briefing

AI Data Centers.
Build them. Operate them. Understand them.