$/MW
EconomicsCost per megawatt of capacity. $8-12M for shell only; $30-50M with modern GPUs.
Glossary
205 terms across power, cooling, networking, compute, storage, software, facility, sustainability, economics, and standards. Searchable. Cross-linked to the curriculum.
Cost per megawatt of capacity. $8-12M for shell only; $30-50M with modern GPUs.
Stronger than annual matching: every hour, every grid is renewable. Google's 2030 target.
Switches dynamically reroute around congested paths. InfiniBand has it natively; Ultra Ethernet adds it.
Pre-cool intake air by evaporating water before it enters dry coolers. Boosts free-cooling hours.
Buy enough renewable certificates to match annual consumption. Weaker than 24/7 — solar at sunset gap covered by gas peakers.
TIA's data center standard, including their own (different) reliability ratings.
Fiber pre-terminated with active electronics on each end. Up to ~30m.
Difference between coolant return temp and outside wet-bulb. Lower approach = more cooling tower size.
Industry envelope for allowable inlet temperatures. W4 = up to 32°C supply for liquid cooling.
Mission Critical Facilities subcommittee. Defines thermal envelopes (W1-W5) for liquid cooling.
Chip designed for one purpose. AI ASICs (TPU, Trainium, MTIA, Maia) compete with general-purpose GPUs.
Switches load between utility and generator power automatically when one source fails.
Switches load between utility and generator power automatically when utility fails.
NVIDIA's cluster management software (formerly Bright Cluster Manager). Provisioning + monitoring + workload.
Lithium-ion battery module. Modern UPS racks use BBUs in N+1 configuration.
Generation owned by the data center, not the utility. Bypasses grid interconnect queues.
Total bandwidth across the worst-case partition of the network. Key metric for fabric quality.
NVIDIA's DPU family. BlueField-3 = ARM cores + 400G ports. Powers offload in modern AI clusters.
Controls + monitors the building (chillers, fans, fire). Separate from DCIM. Honeywell, Johnson Controls dominate.
Retrofit of an existing building. 12-18 month timeline.
A single shell, typically 20-100 MW, with full mechanical and electrical infrastructure.
Pre-fabricated metal enclosure carrying high-current bus bars overhead. Common for high-density rack rows.
Synonym for rack, sometimes implying a more enclosed unit with side panels and doors.
Multiple buildings on one site, sharing power feed and water. Stargate, Hyperion, Project Rainier are campuses.
Capital expenditure — the upfront cost to build the facility and buy the equipment.
Building where multiple network carriers meet. Also called Internet Exchange (IX). Equinix is the largest operator.
Heat exchanger + pumps that isolate the dirty rack-loop water from the clean facility-loop water.
Wafer-scale chip — 46,225 mm², 4 trillion transistors, 900,000 cores. Niche but extraordinary for inference.
Snapshot of model weights + optimizer state, saved periodically during training to survive failures.
Centralized chillers + cooling towers + pumps producing chilled water for the building.
Industrial refrigeration unit producing chilled water for the facility loop.
Multi-stage switching topology providing non-blocking bandwidth between any two endpoints.
Metal block with internal channels that bolts onto a chip; coolant flows through it.
Operator that rents rack space + power + cooling + cross-connects. Doesn't own the IT. Equinix, Digital Realty.
Physical barrier preventing hot/cold air mixing. Improves PUE.
Evaporative heat-rejection device. Uses water; common in moderate climates.
Optical engine integrated next to the switch ASIC. Eliminates pluggable transceivers.
Refrigerant-based air cooling unit, typical for legacy racks under 30 kW.
Chilled-water based air cooler — more efficient than CRAC at scale.
Physical fiber or copper link connecting two tenants in a carrier hotel.
NVIDIA's parallel computing platform and API. The software moat under all NVIDIA's AI dominance.
NVIDIA's GPU-accelerated library of deep neural network primitives.
kg CO2 per kWh of IT energy. Driven by local grid mix.
Short copper cables (≤3m) for in-rack connections. Cheap, low-latency, no optics.
The 'white space' room containing IT racks. 1-10 MW typically.
Large unified storage for raw data of various types, often in object storage.
Replicate the model on every GPU; split the data batch. Simplest form.
NVIDIA's GPU telemetry and management daemon. Source of all GPU monitoring metrics.
Software tracking every rack, U slot, cable, asset. Without DCIM you can't answer 'where is server X?'.
PyTorch async distributed checkpointing — overlaps checkpoint writes with compute.
Dominant supplier of HPC-grade Lustre appliances. Newer Infinia targets AI workloads.
Microsoft's training framework. ZeRO partitioning, pipeline parallelism, mixed precision.
Coolant return temp minus supply temp. Higher ΔT = same heat removed at less flow.
Open-source training platform with hyperparameter tuning. Acquired by HPE in 2021.
NVIDIA reference systems. DGX = NVIDIA-built and sold; HGX = baseboard for OEM-built servers.
Non-conductive coolant for immersion. Mineral oil, GRC ElectroSafe, 3M Novec, Engineered Fluids ElectroCool.
Spinning flywheel + diesel engine combined. Long autonomy, no battery degradation.
Coolant flows through cold plates touching the chip. Required above ~70 kW/rack.
Air-cooled radiator for facility loop. Uses no water; trades higher PUE for zero WUE.
Spread flows across multiple equal-cost paths. Standard in CLOS fabrics.
Small (1-5 MW) facility near population centers for low-latency inference. Different design from training megacampuses.
EU Energy Efficiency Directive 2023/1791 — mandatory annual energy + water reporting for DCs >500kW.
EPA program rating data center energy performance. Uncommon at hyperscale.
Reed-Solomon-style data protection. Fewer raw bytes than mirroring; cheaper than RAID-6 at scale.
Universal LAN standard. AI uses 400/800 GbE in modern fabrics, often with RoCE for RDMA.
Voluntary EU framework for DC energy efficiency. Aligned with the Energy Efficiency Directive.
Distribute MoE experts across GPUs. Adds an all-to-all communication step.
Common CLOS variant; bandwidth doubles at each level upward to maintain non-blocking.
Floating-point operation. Singular: one multiply-add or similar.
Compute throughput unit. Always check the precision (FP32, BF16, FP8, FP4) — they differ by 4-8×.
Number formats with different precision/range. Lower precision = more throughput, slightly less accuracy.
Cooling using outside air or water without mechanical refrigeration. 1000s of hours/year in cold climates.
PyTorch native variant of ZeRO Stage 3. Shards weights + grads + optimizer.
All-or-nothing scheduling: a job either gets all N GPUs or waits. Required for distributed training.
NVIDIA Blackwell rack-scale system: 72 B200 GPUs + 36 Grace CPUs, ~120 kW liquid-cooled.
Drilling for hot rock to generate firm clean power. Fervo Energy is Google's partner in Nevada.
IBM's parallel filesystem. Mature, widely deployed in enterprise HPC.
Coolant flow rate. A B200 cold plate needs ~0.4 GPM (1.5 L/min).
Massively parallel processor originally for graphics, now the backbone of AI training and inference.
NVIDIA Helm-installable bundle: device plugin, drivers, MIG manager, monitoring. K8s must-have.
GPU memory talks directly to network card without going through host RAM. Reduces latency 5-10×.
NVIDIA's ARM-based CPU, paired with Blackwell GPUs in GB200.
All GPUs share their gradient updates after each training step. The dominant network workload.
The mechanical/electrical equipment rooms (chillers, UPS, generators).
New build on undeveloped land. 18-36 month timeline typical.
Gigawatt = 1 billion watts. Approximately one nuclear reactor. Stargate Phase 1 ≈ 1.2 GW.
NVIDIA flagship GPUs (Hopper / Hopper-refresh / Blackwell). H100 launched 2022; B200 in 2025.
Stacked DRAM packaged next to the GPU die. H100 = HBM3 (3.35 TB/s); B200 = HBM3e (8 TB/s).
Capturing waste heat for district heating, agriculture, or process loads. Common in Northern Europe.
Layout where racks face each other so cold air enters one side and hot exhaust exits the other.
Operates millions of servers across global facilities for own use. Microsoft, Google, AWS, Meta.
Servers submerged in dielectric fluid. Single-phase or two-phase. Niche.
High-speed lossless fabric, dominant for AI training. NDR = 400 Gbps, XDR = 800 Gbps. NVIDIA-owned via Mellanox.
Utility waiting list to add new load. In tier-1 markets: 3-5+ years.
International standard for DC facility infrastructure. Becoming the EU reference.
Cloud-native container orchestrator. Needs gang scheduling extensions (Kueue, Volcano) for AI training.
Kubernetes-native job queueing. Newer than Volcano, simpler model.
Apparent power. Real power (kW) ÷ power factor. UPS and generators are sized in kVA.
Kilowatt = 1,000 watts. Standard unit for rack-level power. AI rack: 50-130 kW; traditional: 5-10 kW.
USGBC green building rating. LEED-Platinum data centers exist but aren't common.
Transceivers without DSP. Lower power, lower latency. Emerging in 2024-2026.
Open-source parallel filesystem dominating HPC. Used by national labs and many AI labs.
Microsoft's custom AI accelerator, deployed alongside NVIDIA in Azure.
Pipe assembly distributing coolant from CDU to cold plates. Quick-disconnects on each branch.
Lustre component handling file metadata. Bottleneck if not scaled out.
The room where carriers physically interconnect their networks.
NVIDIA's reference framework for training large transformers. Tensor + pipeline parallelism.
Fraction of theoretical peak FLOPS your training run actually achieves. 30-55% is typical.
AMD Instinct accelerator. 192 GB HBM3 at 5.3 TB/s. Competitive hardware vs H100.
Local generation + storage + load that can island from the utility. Increasingly used at AI campuses.
Hardware partitioning of one A100/H100 into up to 7 isolated slices. Critical for multi-tenant inference.
Open-source S3-compatible object storage. Common for on-prem AI clusters.
Sparse architecture: only some 'experts' (sub-networks) activate per token. Mistral, GPT-4, Gemini use it.
Written, peer-reviewed step-by-step plan for any change in a critical facility. The currency of safe operations.
Alternative aggregation switch placements at row scope rather than rack.
HPC standard for inter-node communication. NCCL has largely replaced MPI for AI.
Meta's custom AI silicon, used in production for ranking/recommendation alongside GPUs.
Megawatt = 1 million watts. Building/campus scale. A 100 MW campus is mid-tier modern AI.
GPU-to-GPU collective communication library. The performance backbone of distributed training.
NVIDIA's full-stack framework for LLM training, fine-tuning, inference deployment.
GPU-as-a-service company built around AI compute. CoreWeave, Lambda, Crusoe, Nscale.
Annual carbon emissions reduced to zero (after offsets/credits).
Containerized inference deployment. Standardizes how models are packaged and served.
NVIDIA proprietary GPU interconnect. NVLink 5 = 1.8 TB/s bidirectional per GPU on Blackwell.
PCIe-attached SSD protocol. Fast (14 GB/s/drive) and now the universal data center standard.
Access remote NVMe drives over network. Flavors: NVMe/RDMA (lowest latency) or NVMe/TCP (commodity).
Switch fabric for NVLink. Connects up to 72 GPUs into one memory domain (NVL72).
S3-style key-value blob storage. Cheap, throughput-oriented, eventual consistency. Holds datasets + model registry.
Open hardware specs initiated by Facebook (2011). Used by Meta, Microsoft, Google. Annual Global Summit.
Open rack/server specification originated at LinkedIn. Smaller scope than OCP.
Custom controller managing complex stateful apps. NVIDIA GPU Operator is the standard for K8s + GPU.
Operating expenditure — the recurring cost to run the facility (power, maintenance, staff).
Lustre data nodes. Each OST is a logical storage unit; OSS hosts many OSTs.
Total downstream port bandwidth ÷ upstream bandwidth. AI fabrics aim for 1:1 (non-blocking).
Send each packet of a flow on a different path. Ultra Ethernet feature; was forbidden in classic Ethernet.
Rack-level power strip on steroids: monitored, often 415V three-phase, feeding individual server PSUs.
Peta = 10^15, Exa = 10^18. NVL72 = 1.4 EFLOPS at FP4.
Split layers of the model across GPUs. Each GPU sees one stage. Across nodes.
The space below raised floors used as a cold-air supply duct in legacy DCs.
Group of 10-30 racks sharing a CDU and aggregation switches.
File system API standard. Lustre, GPFS, WekaFS provide POSIX semantics on parallel storage.
kW per rack. The single most important number in modern DC design — drives cooling, networking, and even building dimensions.
Ratio of real to apparent power. Modern AI servers run ~0.95-0.99. Capacitor banks correct lagging PF.
Long-term contract to buy electricity at a fixed price, often from a specific renewable project.
Length of a power purchase agreement, typically 10-20 years for large industrial loads.
The box inside a server that converts AC mains to DC voltages the motherboard uses.
Total facility power ÷ IT power. 1.0 = perfect, 1.10 = hyperscale, 1.5+ = legacy.
Pluggable optical transceiver form factors. QSFP-DD common at 400/800G; OSFP at 800G+.
Coupling that closes off both halves of a fluid line when separated. Allows hot-swap of components.
Standard 42U or 48U cabinet for IT equipment.
Each server's nth GPU port goes to the nth rail (separate network). Reduces tail latency for collective operations.
Lets one machine read/write another machine's memory without involving the CPU. Sub-microsecond latency.
Liquid-cooled radiator on the back of an air-cooled rack. Bridges air → liquid.
Certificate proving 1 MWh of renewable generation. Tradable; basis of most 'renewable' claims.
N = exactly enough; N+1 = one extra; 2N = full duplicate path. Tier IV requires 2N.
Rack-by-rack or cage-by-cage leasing. Equinix's primary business model.
Brings InfiniBand-like RDMA semantics to standard Ethernet.
AMD's open-source equivalent to CUDA. Closing the gap with NVIDIA but still trails.
GPU fractioning + dynamic scheduling for K8s. Acquired by NVIDIA in 2024 for $700M.
Standard procedure for handling a specific alarm or incident type.
Amazon Simple Storage Service. The de-facto API for object storage; many compatible alternatives exist.
Target temperature the BMS maintains. Modern AI sets supply air at 22-27°C, raising PUE-friendliness.
HPC workload scheduler from LLNL. Native gang scheduling. Used by xAI, Meta research, every HPC center.
Network card with onboard CPU/accelerators offloading networking, storage, security from host.
New generation of nuclear reactors (50-300 MW each). Google + Kairos Power signed a deal in 2024.
Skipping zero or near-zero values during computation. NVIDIA H100/B200 quote 2× FLOPS with structured sparsity.
NVIDIA's Ethernet platform optimized for AI fabrics. Competes with their own InfiniBand offering.
Two-tier CLOS. Leaf = top-of-rack switches; spine = aggregation layer above.
Sub-cycle transfer between two synchronized AC sources. Used downstream of dual UPS systems.
InfiniBand fabric administrative domain. One subnet manager per subnet, all GPUs reachable in one hop count.
InfiniBand control-plane daemon assigning LIDs and computing routes. Critical: never crash.
Where utility transmission voltage is stepped down to medium voltage for the data center.
High-current breakers and protective relays at the medium-voltage level inside the facility.
99th-percentile round-trip time. Bigger driver of training throughput than mean latency.
Contract requiring the buyer to pay even if they don't take delivery. Standard for utility interconnect agreements.
Capex + opex over the asset life. Often expressed as $/GPU-year.
Max sustained power dissipation. H100 = 700W; B200 = 1000W.
Specialized matrix-multiply units inside NVIDIA GPUs since Volta. Where the FLOPS come from.
Split a single matmul across multiple GPUs. Within a node typically; needs fast NVLink.
Hugging Face's inference server. Strong production-grade alternative to vLLM.
Uptime Institute reliability classification. Tier IV = 2N redundant + concurrently maintainable + fault-tolerant.
The physical/logical arrangement of switches and links. CLOS, fat-tree, dragonfly, torus.
The switch at the top of each rack that connects all servers in that rack to the fabric.
PyTorch fault-tolerant training — handles node failures dynamically without restarting the whole job.
Google's custom AI accelerator. Only available inside Google Cloud / internally.
AWS custom AI training chip. Trainium2 (2024) powers Project Rainier with Anthropic.
Steps voltage from medium (13.8/34.5 kV) to distribution (480 V US, 400 V EU).
Both a JIT compiler (OpenAI Triton) and an inference server (NVIDIA Triton). Confusingly named.
NVIDIA's open inference server. Multi-framework, multi-model, dynamic batching.
Rack unit = 44.45 mm (1.75 inches). A '1U server' is one slot tall.
Open multi-vendor consortium effort (AMD, Broadcom, Cisco, Meta, Microsoft, Oracle...) targeting AI workloads at 1.6T. Production 2025-2026.
Battery or flywheel system that bridges the <10s gap between grid loss and generator startup.
Accounting depreciation period. Microsoft, Meta extended to 5.5-6 years on AI servers in 2022-2024.
DASE (Disaggregated Shared Everything) all-flash architecture. CoreWeave is a notable customer.
NVIDIA's software GPU virtualization. Different from MIG: time-sliced, lower isolation.
Open-source LLM inference server. Continuous batching, paged attention. Industry standard.
K8s scheduler add-on built specifically for AI/HPC workloads. Adds gang scheduling, fair share, queues.
GPU memory. For AI accelerators it's HBM specifically.
Replenishing more water than consumed. Microsoft's 2030 commitment.
Software-defined NVMe-only parallel FS. Strong AI mindshare; deployed at Stability AI, Cohere.
Flexible cable from busway tap to rack PDU. Color-coded per phase.
The IT rack area. Distinct from 'gray space' (mechanical rooms).
Whole data hall or building leased to one customer. Common for hyperscalers leasing from Digital Realty, CyrusOne.
Liters of water per kWh of IT energy. Evap ≈ 1.8 L/kWh, dry cooling ≈ 0.
DeepSpeed memory optimization. Shards optimizer/gradients/weights across GPUs to fit larger models.
Every term here is covered in the 12-lesson curriculum. Looking for hands-on training? See the curated courses page.