Question 1

What actually is a data center?

Accepted Answer

A data center is a building engineered to keep computers running 24/7. Four physical systems: power (from the grid + backups), cooling (to remove heat), networking (fiber connectivity), and the IT itself (servers, storage, switches). Everything else is accessory.

Question 2

How is an AI data center different from a regular one?

Accepted Answer

Power density is roughly 10× higher. A traditional rack draws 5–10 kW; an AI rack with GPUs draws 70–130 kW. This single fact forces every other design decision: liquid cooling instead of air, InfiniBand instead of Ethernet, dedicated substations instead of grid taps, and campuses sized in gigawatts instead of megawatts.

Question 3

What does 'Tier III' actually mean?

Accepted Answer

A reliability rating from Uptime Institute. Tier III means 'concurrently maintainable' — you can take any one component offline for repair without losing the facility. Target uptime: 99.982% (≈1.6 hours of downtime per year). Most AI facilities design to Tier III; hyperscalers often skip certification and build to their own equivalent specs.

Question 4

What's a 'rack unit' or 'U'?

Accepted Answer

1U = 1.75 inches (44.45 mm) of vertical space. A standard cabinet is 42U or 48U tall. A 1U server is one slot; an 8U GPU server takes 8 slots.

Question 5

What's the difference between scale-up and scale-out?

Accepted Answer

Scale-up = making a bunch of GPUs act like one big GPU (NVLink within a rack, ~1.8 TB/s per GPU). Scale-out = connecting thousands of those units together across the building (InfiniBand at 400–800 Gbps per port). Both are needed to train frontier models.

Question 6

Why are AI campuses measured in gigawatts now?

Accepted Answer

One modern rack (GB200 NVL72) draws ~120 kW. A 100 MW campus = ~800 of those racks. A 1 GW campus = ~8,000. Frontier training runs need 30,000–500,000+ GPUs on a single network fabric, which drives power into the gigawatt range.

Question 7

Why is grid power the bottleneck?

Accepted Answer

Utilities don't build new transmission + generation quickly. Adding 500+ MW of new industrial load requires new substations, transmission lines, and often new power plants. In tier-1 markets (Northern Virginia, Dublin, Singapore) the queue is 5+ years. That's why xAI built gas turbines in Memphis and OpenAI chose Texas.

Question 8

Why are hyperscalers buying nuclear power?

Accepted Answer

AI needs 24/7 firm clean power. Solar stops at sunset. Wind is intermittent. Gas emits carbon. Nuclear is the only source that is firm AND low-carbon AND scalable. Microsoft-Three Mile Island, AWS-Talen Susquehanna, Google-Kairos Power (SMR) — all signed in 2024.

Question 9

Why did xAI use gas turbines without permits?

Accepted Answer

Shelby County, TN only requires air permits for generators in place more than 364 days. xAI declared them 'portable' and ran them unpermitted for ~18 months. The NAACP + SELC filed Clean Air Act notices; xAI removed them and got permits for the rest. The EPA closed the 'portable' loophole in January 2026.

Question 10

What's a UPS? Why is it different from a battery backup?

Accepted Answer

UPS = Uninterruptible Power Supply. It's the bridge between the grid cutting out and the generators starting. UPS responds in <10 milliseconds; generators take 10+ seconds to come online. Without the UPS bridge, everything reboots every time the grid flickers.

Question 11

What does PUE mean?

Accepted Answer

Power Usage Effectiveness. Total facility power divided by IT power. 1.0 = perfect (all energy goes to computers). 1.10 = hyperscale target (10% overhead on cooling+lights). 1.5+ = legacy enterprise. A 100 MW facility with PUE 1.10 means 90 MW to the IT + 10 MW to cooling/lights/losses.

Question 12

Why is liquid cooling mandatory for modern AI?

Accepted Answer

Physics. Air carries ~1.2 kJ of heat per cubic meter per degree. Water carries ~4,180 kJ per cubic meter per degree — 3,000× more. Above ~30-40 kW per rack, air flow can't keep up. B200 chips run at 1000W TDP; only direct liquid contact via cold plates can move that heat away fast enough.

Question 13

What is immersion cooling? Is it really used?

Accepted Answer

Immersion = the servers are submerged in a dielectric (non-conductive) fluid. Best thermal performance, lowest PUE, but operationally complex — you can't just yank a drive out. Used in niche deployments. Two-phase immersion (boiling fluorocarbons) had momentum but was hit by 3M's PFAS phase-out announcement.

Question 14

What's WUE? Does water usage matter?

Accepted Answer

Water Usage Effectiveness — liters of water consumed per kWh of IT energy. Evaporative cooling ~1.8 L/kWh; closed-loop dry cooling ~0. In water-stressed regions (Phoenix, Madrid) operators switch to dry cooling and accept worse PUE. Microsoft pledged 'water positive by 2030' — replenishing more than they consume.

Question 15

Do I need NVIDIA to train frontier AI?

Accepted Answer

Technically no. Anthropic's Project Rainier runs on ~500,000 AWS Trainium2 chips. Google trains Gemini on TPUs. AMD MI300X is hardware-competitive. But in 2026 NVIDIA still has the most mature software stack (CUDA + NCCL + NeMo + TensorRT) so switching costs years of engineering.

Question 16

What is a GB200 NVL72?

Accepted Answer

NVIDIA's Blackwell rack-scale system: 72 B200 GPUs + 36 Grace CPUs in one liquid-cooled rack, connected as a single NVLink domain via 9 NVSwitch trays. Total: ~14 TB HBM3e, 1.4 EFLOPS FP4, ~120 kW. Effectively one very big GPU.

Question 17

What's MFU?

Accepted Answer

Model FLOPs Utilization — the % of theoretical peak FLOPs your training job actually achieves. 30-55% is normal. The gap is eaten by communication overhead, memory stalls, restarts, and sub-optimal kernels. Peak FLOPs are marketing; MFU is reality.

Question 18

What does FP8 or FP4 mean?

Accepted Answer

Number formats used in AI math. Lower precision = more throughput, less memory, slightly less accuracy. FP32 is classic. BF16 is standard training. FP8 is used by Blackwell for modern training runs. FP4 is used for inference. A single NVL72 rack does 1.4 EFLOPS at FP4 but only 360 PFLOPS at BF16.

Question 19

Why InfiniBand and not Ethernet?

Accepted Answer

InfiniBand was designed for HPC — lossless flow control, hardware RDMA, sub-microsecond latency. Ethernet was designed for internet traffic. For gradient synchronization across 10,000+ GPUs, InfiniBand's latency and packet-loss behavior matters. Ultra Ethernet (2024 consortium) is the open alternative catching up.

Question 20

What's NVLink?

Accepted Answer

NVIDIA's proprietary GPU-to-GPU interconnect. NVLink 5 on Blackwell = 1.8 TB/s bidirectional per GPU. Within a rack (NVL72), 72 GPUs share one NVLink domain. Cross-rack, you use InfiniBand. NVLink is fast but short; InfiniBand is slower but scales.

Question 21

How much does a 100 MW AI data center cost?

Accepted Answer

Turnkey with modern GPUs: $2.5-5 billion. Shell-only (power + cooling + networking + building): $800M-1.2B. IT equipment (the 30,000 GPUs): $900M+. GPUs alone = roughly half of total capex.

Question 22

Why does Blue Owl own the Hyperion campus instead of Meta?

Accepted Answer

Capital structure. Meta can't shoulder $27B of capex directly on its balance sheet. Private equity (Blue Owl) provides the capital. Meta becomes a tenant with an exit option every four years. This structure is now standard for 2+ GW builds — Stargate uses a similar four-party JV.

Question 23

What is a 'neocloud'?

Accepted Answer

A GPU-as-a-service company built around AI compute. CoreWeave, Lambda, Crusoe, Nscale. They buy GPUs (often debt-financed), rent them out by the hour, and compete with hyperscaler clouds on price and availability. CoreWeave IPO'd in March 2025 at a $23B valuation.

Question 24

Are AI data centers an environmental disaster?

Accepted Answer

Complicated. Grid intensity varies wildly: 50 g CO₂/kWh in France, 5 g in Iceland, ~400 g in Texas. A 1 GW campus on a coal grid is a disaster; on a hydro grid, it's not. Current AI buildout is happening mostly on US grids with significant gas + coal, so emissions are rising even as renewable share grows. Hyperscalers have 2030 net-zero pledges — their 2024 reports show emissions UP vs 2020.

Question 25

Why does Louisiana need 10 new gas plants for one data center?

Accepted Answer

Meta's Hyperion will ultimately consume ~5 GW — more than half of Entergy Louisiana's total current generation. The state couldn't provide that from existing plants. 10 gas plants + 240 miles of transmission + battery storage are the fastest-to-build option. An MOU on future nuclear is part of the long-term plan.

Question 26

How do I get a job in data centers?

Accepted Answer

Three paths. (1) Technician: start with EPI CDCP certification ($1,400) → apply at Equinix/Digital Realty/hyperscalers. (2) Engineer: EE or ME degree → MEP consulting firm (JBA, Bala, Syska) → Uptime ATD cert. (3) AI cluster engineer: software/SRE background → NVIDIA NCA-AIIO cert → target Anthropic/OpenAI/CoreWeave.

Question 27

What do data center jobs pay?

Accepted Answer

US figures (median). Technician: $65-95k. Critical Facilities Technician: $75-120k. Data Center Engineer: $95-160k. Facilities Manager: $140-220k. AI Cluster Engineer at frontier labs: $200-400k+ total comp. France: generally 30-40% lower in base salary, compensated partly by 13th month, bonuses, and stronger benefits.

Question 28

What's the Stargate Project?

Accepted Answer

A Delaware joint venture of OpenAI, Oracle, SoftBank, and MGX committing up to $500B by 2029 for US AI infrastructure. Announced by Trump on Jan 21, 2025. Flagship site is a 1.2 GW campus in Abilene, Texas built by Crusoe on the Lancium Clean Campus. Five more sites now announced; total target 10 GW.

Question 29

Why was Stargate Abilene 'capped' at 1.2 GW?

Accepted Answer

Oracle and OpenAI had been negotiating to expand to 2 GW. Additional grid capacity wouldn't be available until 2027+. Rather than wait, both walked. Microsoft leased the adjacent 900 MW. OpenAI pivoted new Stargate capacity to NVIDIA's next-generation 'Vera Rubin' silicon on five other sites.

Question 30

Is there a 'data center bubble'?

Accepted Answer

Disputed. Bull case: AI demand compounds → every GPU finds a buyer → capex creates real assets. Bear case: 2024-2027 hyperscaler capex will exceed $1T, the closest analog is the 2000 telecom buildout which preceded ~90% peak-to-trough drops in carrier valuations. The honest answer: watch GPU utilization + training workload growth vs capex pace through 2027.

Straight answers.

Basics

Power

Cooling

Compute

Networking

Economics

Sustainability

Careers

Context

Question not here?