FAQ · 30 Questions
Straight answers.
The questions people actually ask about AI data centers, grouped by topic. Each answer is 2-4 sentences max. Deeper lessons linked where it helps.
Basics
What actually is a data center?+
A data center is a building engineered to keep computers running 24/7. Four physical systems: power (from the grid + backups), cooling (to remove heat), networking (fiber connectivity), and the IT itself (servers, storage, switches). Everything else is accessory.
How is an AI data center different from a regular one?+
Power density is roughly 10× higher. A traditional rack draws 5–10 kW; an AI rack with GPUs draws 70–130 kW. This single fact forces every other design decision: liquid cooling instead of air, InfiniBand instead of Ethernet, dedicated substations instead of grid taps, and campuses sized in gigawatts instead of megawatts.
What does 'Tier III' actually mean?+
A reliability rating from Uptime Institute. Tier III means 'concurrently maintainable' — you can take any one component offline for repair without losing the facility. Target uptime: 99.982% (≈1.6 hours of downtime per year). Most AI facilities design to Tier III; hyperscalers often skip certification and build to their own equivalent specs.
What's a 'rack unit' or 'U'?+
1U = 1.75 inches (44.45 mm) of vertical space. A standard cabinet is 42U or 48U tall. A 1U server is one slot; an 8U GPU server takes 8 slots.
What's the difference between scale-up and scale-out?+
Scale-up = making a bunch of GPUs act like one big GPU (NVLink within a rack, ~1.8 TB/s per GPU). Scale-out = connecting thousands of those units together across the building (InfiniBand at 400–800 Gbps per port). Both are needed to train frontier models.
Power
Why are AI campuses measured in gigawatts now?+
One modern rack (GB200 NVL72) draws ~120 kW. A 100 MW campus = ~800 of those racks. A 1 GW campus = ~8,000. Frontier training runs need 30,000–500,000+ GPUs on a single network fabric, which drives power into the gigawatt range.
Why is grid power the bottleneck?+
Utilities don't build new transmission + generation quickly. Adding 500+ MW of new industrial load requires new substations, transmission lines, and often new power plants. In tier-1 markets (Northern Virginia, Dublin, Singapore) the queue is 5+ years. That's why xAI built gas turbines in Memphis and OpenAI chose Texas.
Why are hyperscalers buying nuclear power?+
AI needs 24/7 firm clean power. Solar stops at sunset. Wind is intermittent. Gas emits carbon. Nuclear is the only source that is firm AND low-carbon AND scalable. Microsoft-Three Mile Island, AWS-Talen Susquehanna, Google-Kairos Power (SMR) — all signed in 2024.
Why did xAI use gas turbines without permits?+
Shelby County, TN only requires air permits for generators in place more than 364 days. xAI declared them 'portable' and ran them unpermitted for ~18 months. The NAACP + SELC filed Clean Air Act notices; xAI removed them and got permits for the rest. The EPA closed the 'portable' loophole in January 2026.
What's a UPS? Why is it different from a battery backup?+
UPS = Uninterruptible Power Supply. It's the bridge between the grid cutting out and the generators starting. UPS responds in <10 milliseconds; generators take 10+ seconds to come online. Without the UPS bridge, everything reboots every time the grid flickers.
What does PUE mean?+
Power Usage Effectiveness. Total facility power divided by IT power. 1.0 = perfect (all energy goes to computers). 1.10 = hyperscale target (10% overhead on cooling+lights). 1.5+ = legacy enterprise. A 100 MW facility with PUE 1.10 means 90 MW to the IT + 10 MW to cooling/lights/losses.
Cooling
Why is liquid cooling mandatory for modern AI?+
Physics. Air carries ~1.2 kJ of heat per cubic meter per degree. Water carries ~4,180 kJ per cubic meter per degree — 3,000× more. Above ~30-40 kW per rack, air flow can't keep up. B200 chips run at 1000W TDP; only direct liquid contact via cold plates can move that heat away fast enough.
What is immersion cooling? Is it really used?+
Immersion = the servers are submerged in a dielectric (non-conductive) fluid. Best thermal performance, lowest PUE, but operationally complex — you can't just yank a drive out. Used in niche deployments. Two-phase immersion (boiling fluorocarbons) had momentum but was hit by 3M's PFAS phase-out announcement.
What's WUE? Does water usage matter?+
Water Usage Effectiveness — liters of water consumed per kWh of IT energy. Evaporative cooling ~1.8 L/kWh; closed-loop dry cooling ~0. In water-stressed regions (Phoenix, Madrid) operators switch to dry cooling and accept worse PUE. Microsoft pledged 'water positive by 2030' — replenishing more than they consume.
Compute
Do I need NVIDIA to train frontier AI?+
Technically no. Anthropic's Project Rainier runs on ~500,000 AWS Trainium2 chips. Google trains Gemini on TPUs. AMD MI300X is hardware-competitive. But in 2026 NVIDIA still has the most mature software stack (CUDA + NCCL + NeMo + TensorRT) so switching costs years of engineering.
What is a GB200 NVL72?+
NVIDIA's Blackwell rack-scale system: 72 B200 GPUs + 36 Grace CPUs in one liquid-cooled rack, connected as a single NVLink domain via 9 NVSwitch trays. Total: ~14 TB HBM3e, 1.4 EFLOPS FP4, ~120 kW. Effectively one very big GPU.
What's MFU?+
Model FLOPs Utilization — the % of theoretical peak FLOPs your training job actually achieves. 30-55% is normal. The gap is eaten by communication overhead, memory stalls, restarts, and sub-optimal kernels. Peak FLOPs are marketing; MFU is reality.
What does FP8 or FP4 mean?+
Number formats used in AI math. Lower precision = more throughput, less memory, slightly less accuracy. FP32 is classic. BF16 is standard training. FP8 is used by Blackwell for modern training runs. FP4 is used for inference. A single NVL72 rack does 1.4 EFLOPS at FP4 but only 360 PFLOPS at BF16.
Networking
Why InfiniBand and not Ethernet?+
InfiniBand was designed for HPC — lossless flow control, hardware RDMA, sub-microsecond latency. Ethernet was designed for internet traffic. For gradient synchronization across 10,000+ GPUs, InfiniBand's latency and packet-loss behavior matters. Ultra Ethernet (2024 consortium) is the open alternative catching up.
What's NVLink?+
NVIDIA's proprietary GPU-to-GPU interconnect. NVLink 5 on Blackwell = 1.8 TB/s bidirectional per GPU. Within a rack (NVL72), 72 GPUs share one NVLink domain. Cross-rack, you use InfiniBand. NVLink is fast but short; InfiniBand is slower but scales.
Economics
How much does a 100 MW AI data center cost?+
Turnkey with modern GPUs: $2.5-5 billion. Shell-only (power + cooling + networking + building): $800M-1.2B. IT equipment (the 30,000 GPUs): $900M+. GPUs alone = roughly half of total capex.
Why does Blue Owl own the Hyperion campus instead of Meta?+
Capital structure. Meta can't shoulder $27B of capex directly on its balance sheet. Private equity (Blue Owl) provides the capital. Meta becomes a tenant with an exit option every four years. This structure is now standard for 2+ GW builds — Stargate uses a similar four-party JV.
What is a 'neocloud'?+
A GPU-as-a-service company built around AI compute. CoreWeave, Lambda, Crusoe, Nscale. They buy GPUs (often debt-financed), rent them out by the hour, and compete with hyperscaler clouds on price and availability. CoreWeave IPO'd in March 2025 at a $23B valuation.
Sustainability
Are AI data centers an environmental disaster?+
Complicated. Grid intensity varies wildly: 50 g CO₂/kWh in France, 5 g in Iceland, ~400 g in Texas. A 1 GW campus on a coal grid is a disaster; on a hydro grid, it's not. Current AI buildout is happening mostly on US grids with significant gas + coal, so emissions are rising even as renewable share grows. Hyperscalers have 2030 net-zero pledges — their 2024 reports show emissions UP vs 2020.
Why does Louisiana need 10 new gas plants for one data center?+
Meta's Hyperion will ultimately consume ~5 GW — more than half of Entergy Louisiana's total current generation. The state couldn't provide that from existing plants. 10 gas plants + 240 miles of transmission + battery storage are the fastest-to-build option. An MOU on future nuclear is part of the long-term plan.
Careers
How do I get a job in data centers?+
Three paths. (1) Technician: start with EPI CDCP certification ($1,400) → apply at Equinix/Digital Realty/hyperscalers. (2) Engineer: EE or ME degree → MEP consulting firm (JBA, Bala, Syska) → Uptime ATD cert. (3) AI cluster engineer: software/SRE background → NVIDIA NCA-AIIO cert → target Anthropic/OpenAI/CoreWeave.
What do data center jobs pay?+
US figures (median). Technician: $65-95k. Critical Facilities Technician: $75-120k. Data Center Engineer: $95-160k. Facilities Manager: $140-220k. AI Cluster Engineer at frontier labs: $200-400k+ total comp. France: generally 30-40% lower in base salary, compensated partly by 13th month, bonuses, and stronger benefits.
Context
What's the Stargate Project?+
A Delaware joint venture of OpenAI, Oracle, SoftBank, and MGX committing up to $500B by 2029 for US AI infrastructure. Announced by Trump on Jan 21, 2025. Flagship site is a 1.2 GW campus in Abilene, Texas built by Crusoe on the Lancium Clean Campus. Five more sites now announced; total target 10 GW.
Why was Stargate Abilene 'capped' at 1.2 GW?+
Oracle and OpenAI had been negotiating to expand to 2 GW. Additional grid capacity wouldn't be available until 2027+. Rather than wait, both walked. Microsoft leased the adjacent 900 MW. OpenAI pivoted new Stargate capacity to NVIDIA's next-generation 'Vera Rubin' silicon on five other sites.
Is there a 'data center bubble'?+
Disputed. Bull case: AI demand compounds → every GPU finds a buyer → capex creates real assets. Bear case: 2024-2027 hyperscaler capex will exceed $1T, the closest analog is the 2000 telecom buildout which preceded ~90% peak-to-trough drops in carrier valuations. The honest answer: watch GPU utilization + training workload growth vs capex pace through 2027.
Question not here?
Browse the 205-term glossary for single-word definitions, the 12-lesson curriculum for deep dives, or the 4 case studies for the real named campuses.