Glossary

Every term you'll meet — defined.

205 terms across power, cooling, networking, compute, storage, software, facility, sustainability, economics, and standards. Searchable. Cross-linked to the curriculum.

#

$/MW

Economics

Cost per megawatt of capacity. $8-12M for shell only; $30-50M with modern GPUs.

24/7 carbon-free energy

Sustainability

Stronger than annual matching: every hour, every grid is renewable. Google's 2030 target.

A

Adaptive routing

Networking

Switches dynamically reroute around congested paths. InfiniBand has it natively; Ultra Ethernet adds it.

Adiabatic cooling

Cooling

Pre-cool intake air by evaporating water before it enters dry coolers. Boosts free-cooling hours.

Annual matching

Sustainability

Buy enough renewable certificates to match annual consumption. Weaker than 24/7 — solar at sunset gap covered by gas peakers.

ANSI/TIA-942

Standards

TIA's data center standard, including their own (different) reliability ratings.

AOC

— Active Optical CableNetworking

Fiber pre-terminated with active electronics on each end. Up to ~30m.

Approach

Cooling

Difference between coolant return temp and outside wet-bulb. Lower approach = more cooling tower size.

ASHRAE class

Cooling

Industry envelope for allowable inlet temperatures. W4 = up to 32°C supply for liquid cooling.

ASHRAE TC 9.9

Standards

Mission Critical Facilities subcommittee. Defines thermal envelopes (W1-W5) for liquid cooling.

ASIC

— Application-Specific ICCompute

Chip designed for one purpose. AI ASICs (TPU, Trainium, MTIA, Maia) compete with general-purpose GPUs.

ATS

— Automatic Transfer SwitchPower

Switches load between utility and generator power automatically when one source fails.

ATS

— Automatic Transfer SwitchPower

Switches load between utility and generator power automatically when utility fails.

B

Base Command

Software

NVIDIA's cluster management software (formerly Bright Cluster Manager). Provisioning + monitoring + workload.

BBU

— Battery Backup UnitPower

Lithium-ion battery module. Modern UPS racks use BBUs in N+1 configuration.

Behind-the-meter

Power

Generation owned by the data center, not the utility. Bypasses grid interconnect queues.

Bisection bandwidth

Networking

Total bandwidth across the worst-case partition of the network. Key metric for fabric quality.

BlueField

Networking

NVIDIA's DPU family. BlueField-3 = ARM cores + 400G ports. Powers offload in modern AI clusters.

BMS

— Building Management SystemSoftware

Controls + monitors the building (chillers, fans, fire). Separate from DCIM. Honeywell, Johnson Controls dominate.

Brownfield

Facility

Retrofit of an existing building. 12-18 month timeline.

Building

Facility

A single shell, typically 20-100 MW, with full mechanical and electrical infrastructure.

Busway

Power

Pre-fabricated metal enclosure carrying high-current bus bars overhead. Common for high-density rack rows.

C

Cabinet

Facility

Synonym for rack, sometimes implying a more enclosed unit with side panels and doors.

Campus

Facility

Multiple buildings on one site, sharing power feed and water. Stargate, Hyperion, Project Rainier are campuses.

Capex

Economics

Capital expenditure — the upfront cost to build the facility and buy the equipment.

Carrier hotel

Facility

Building where multiple network carriers meet. Also called Internet Exchange (IX). Equinix is the largest operator.

CDU

— Coolant Distribution UnitCooling

Heat exchanger + pumps that isolate the dirty rack-loop water from the clean facility-loop water.

Cerebras WSE-3

Compute

Wafer-scale chip — 46,225 mm², 4 trillion transistors, 900,000 cores. Niche but extraordinary for inference.

Checkpoint

Storage

Snapshot of model weights + optimizer state, saved periodically during training to survive failures.

Chilled water plant

Cooling

Centralized chillers + cooling towers + pumps producing chilled water for the building.

Chiller

Cooling

Industrial refrigeration unit producing chilled water for the facility loop.

CLOS

— Charles Clos topology, 1953Networking

Multi-stage switching topology providing non-blocking bandwidth between any two endpoints.

Cold plate

Cooling

Metal block with internal channels that bolts onto a chip; coolant flows through it.

Colocation (colo)

Facility

Operator that rents rack space + power + cooling + cross-connects. Doesn't own the IT. Equinix, Digital Realty.

Containment

Cooling

Physical barrier preventing hot/cold air mixing. Improves PUE.

Cooling tower

Cooling

Evaporative heat-rejection device. Uses water; common in moderate climates.

CPO

— Co-Packaged OpticsNetworking

Optical engine integrated next to the switch ASIC. Eliminates pluggable transceivers.

CRAC

— Computer Room Air ConditionerCooling

Refrigerant-based air cooling unit, typical for legacy racks under 30 kW.

CRAH

— Computer Room Air HandlerCooling

Chilled-water based air cooler — more efficient than CRAC at scale.

Cross-connect

Facility

Physical fiber or copper link connecting two tenants in a carrier hotel.

CUDA

Compute

NVIDIA's parallel computing platform and API. The software moat under all NVIDIA's AI dominance.

cuDNN

Compute

NVIDIA's GPU-accelerated library of deep neural network primitives.

CUE

— Carbon Usage EffectivenessCooling

kg CO2 per kWh of IT energy. Driven by local grid mix.

D

DAC

— Direct Attached CopperNetworking

Short copper cables (≤3m) for in-rack connections. Cheap, low-latency, no optics.

Data hall

Facility

The 'white space' room containing IT racks. 1-10 MW typically.

Data lake

Storage

Large unified storage for raw data of various types, often in object storage.

Data parallelism

Compute

Replicate the model on every GPU; split the data batch. Simplest form.

DCGM

— Data Center GPU ManagerSoftware

NVIDIA's GPU telemetry and management daemon. Source of all GPU monitoring metrics.

DCIM

— Data Center Infrastructure ManagementSoftware

Software tracking every rack, U slot, cable, asset. Without DCIM you can't answer 'where is server X?'.

DCP

— Distributed CheckpointStorage

PyTorch async distributed checkpointing — overlaps checkpoint writes with compute.

DDN

— DataDirect NetworksStorage

Dominant supplier of HPC-grade Lustre appliances. Newer Infinia targets AI workloads.

DeepSpeed

Software

Microsoft's training framework. ZeRO partitioning, pipeline parallelism, mixed precision.

Delta T

Cooling

Coolant return temp minus supply temp. Higher ΔT = same heat removed at less flow.

Determined AI

Software

Open-source training platform with hyperparameter tuning. Acquired by HPE in 2021.

DGX / HGX

Compute

NVIDIA reference systems. DGX = NVIDIA-built and sold; HGX = baseboard for OEM-built servers.

Dielectric fluid

Cooling

Non-conductive coolant for immersion. Mineral oil, GRC ElectroSafe, 3M Novec, Engineered Fluids ElectroCool.

Diesel rotary UPS (DRUPS)

Power

Spinning flywheel + diesel engine combined. Long autonomy, no battery degradation.

DLC

— Direct Liquid CoolingCooling

Coolant flows through cold plates touching the chip. Required above ~70 kW/rack.

Dry cooler

Cooling

Air-cooled radiator for facility loop. Uses no water; trades higher PUE for zero WUE.

E

ECMP

— Equal-Cost Multi-PathNetworking

Spread flows across multiple equal-cost paths. Standard in CLOS fabrics.

Edge data center

Facility

Small (1-5 MW) facility near population centers for low-latency inference. Different design from training megacampuses.

EED reporting (EU)

Standards

EU Energy Efficiency Directive 2023/1791 — mandatory annual energy + water reporting for DCs >500kW.

Energy Star (DC)

Standards

EPA program rating data center energy performance. Uncommon at hyperscale.

Erasure coding

Storage

Reed-Solomon-style data protection. Fewer raw bytes than mirroring; cheaper than RAID-6 at scale.

Ethernet

Networking

Universal LAN standard. AI uses 400/800 GbE in modern fabrics, often with RoCE for RDMA.

EU Code of Conduct (DC)

Standards

Voluntary EU framework for DC energy efficiency. Aligned with the Energy Efficiency Directive.

Expert parallelism

Compute

Distribute MoE experts across GPUs. Adds an all-to-all communication step.

F

Fat-tree

Networking

Common CLOS variant; bandwidth doubles at each level upward to maintain non-blocking.

FLOP

Compute

Floating-point operation. Singular: one multiply-add or similar.

FLOPS

— Floating Point Ops/secCompute

Compute throughput unit. Always check the precision (FP32, BF16, FP8, FP4) — they differ by 4-8×.

FP32 / BF16 / FP16 / FP8 / FP4

Compute

Number formats with different precision/range. Lower precision = more throughput, slightly less accuracy.

Free cooling

Cooling

Cooling using outside air or water without mechanical refrigeration. 1000s of hours/year in cold climates.

FSDP

— Fully Sharded Data ParallelCompute

PyTorch native variant of ZeRO Stage 3. Shards weights + grads + optimizer.

G

Gang scheduling

Software

All-or-nothing scheduling: a job either gets all N GPUs or waits. Required for distributed training.

GB200 NVL72

Compute

NVIDIA Blackwell rack-scale system: 72 B200 GPUs + 36 Grace CPUs, ~120 kW liquid-cooled.

Geothermal (enhanced)

Sustainability

Drilling for hot rock to generate firm clean power. Fervo Energy is Google's partner in Nevada.

GPFS / IBM Storage Scale

Storage

IBM's parallel filesystem. Mature, widely deployed in enterprise HPC.

GPM

— Gallons Per MinuteCooling

Coolant flow rate. A B200 cold plate needs ~0.4 GPM (1.5 L/min).

GPU

— Graphics Processing UnitCompute

Massively parallel processor originally for graphics, now the backbone of AI training and inference.

GPU Operator

Software

NVIDIA Helm-installable bundle: device plugin, drivers, MIG manager, monitoring. K8s must-have.

GPUDirect RDMA

Networking

GPU memory talks directly to network card without going through host RAM. Reduces latency 5-10×.

Grace

Compute

NVIDIA's ARM-based CPU, paired with Blackwell GPUs in GB200.

Gradient sync

Networking

All GPUs share their gradient updates after each training step. The dominant network workload.

Gray space

Facility

The mechanical/electrical equipment rooms (chillers, UPS, generators).

Greenfield

Facility

New build on undeveloped land. 18-36 month timeline typical.

GW

Power

Gigawatt = 1 billion watts. Approximately one nuclear reactor. Stargate Phase 1 ≈ 1.2 GW.

H

H100 / H200 / B200

Compute

NVIDIA flagship GPUs (Hopper / Hopper-refresh / Blackwell). H100 launched 2022; B200 in 2025.

HBM

— High Bandwidth MemoryCompute

Stacked DRAM packaged next to the GPU die. H100 = HBM3 (3.35 TB/s); B200 = HBM3e (8 TB/s).

Heat reuse

Cooling

Capturing waste heat for district heating, agriculture, or process loads. Common in Northern Europe.

Hot aisle / cold aisle

Cooling

Layout where racks face each other so cold air enters one side and hot exhaust exits the other.

Hyperscaler

Facility

Operates millions of servers across global facilities for own use. Microsoft, Google, AWS, Meta.

I

Immersion cooling

Cooling

Servers submerged in dielectric fluid. Single-phase or two-phase. Niche.

InfiniBand

Networking

High-speed lossless fabric, dominant for AI training. NDR = 400 Gbps, XDR = 800 Gbps. NVIDIA-owned via Mellanox.

Interconnect queue

Power

Utility waiting list to add new load. In tier-1 markets: 3-5+ years.

ISO/IEC 22237

Standards

International standard for DC facility infrastructure. Becoming the EU reference.

K

Kubernetes

Software

Cloud-native container orchestrator. Needs gang scheduling extensions (Kueue, Volcano) for AI training.

Kueue

Software

Kubernetes-native job queueing. Newer than Volcano, simpler model.

kVA

Power

Apparent power. Real power (kW) ÷ power factor. UPS and generators are sized in kVA.

kW

Power

Kilowatt = 1,000 watts. Standard unit for rack-level power. AI rack: 50-130 kW; traditional: 5-10 kW.

L

LEED

Standards

USGBC green building rating. LEED-Platinum data centers exist but aren't common.

LPO

— Linear Pluggable OpticsNetworking

Transceivers without DSP. Lower power, lower latency. Emerging in 2024-2026.

Lustre

Storage

Open-source parallel filesystem dominating HPC. Used by national labs and many AI labs.

M

Maia 100

Compute

Microsoft's custom AI accelerator, deployed alongside NVIDIA in Azure.

Manifold

Cooling

Pipe assembly distributing coolant from CDU to cold plates. Quick-disconnects on each branch.

MDS

— MetaData ServerStorage

Lustre component handling file metadata. Bottleneck if not scaled out.

Meet-me room

Facility

The room where carriers physically interconnect their networks.

Megatron-LM

Software

NVIDIA's reference framework for training large transformers. Tensor + pipeline parallelism.

MFU

— Model FLOPs UtilizationCompute

Fraction of theoretical peak FLOPS your training run actually achieves. 30-55% is typical.

MI300X

Compute

AMD Instinct accelerator. 192 GB HBM3 at 5.3 TB/s. Competitive hardware vs H100.

Microgrid

Power

Local generation + storage + load that can island from the utility. Increasingly used at AI campuses.

MIG

— Multi-Instance GPUSoftware

Hardware partitioning of one A100/H100 into up to 7 isolated slices. Critical for multi-tenant inference.

MinIO

Storage

Open-source S3-compatible object storage. Common for on-prem AI clusters.

MoE

— Mixture of ExpertsCompute

Sparse architecture: only some 'experts' (sub-networks) activate per token. Mistral, GPT-4, Gemini use it.

MOP

— Method of ProcedureSoftware

Written, peer-reviewed step-by-step plan for any change in a critical facility. The currency of safe operations.

MoR / EoR

— Middle/End of RowNetworking

Alternative aggregation switch placements at row scope rather than rack.

MPI

— Message Passing InterfaceNetworking

HPC standard for inter-node communication. NCCL has largely replaced MPI for AI.

MTIA

— Meta Training and Inference AcceleratorCompute

Meta's custom AI silicon, used in production for ranking/recommendation alongside GPUs.

MW

Power

Megawatt = 1 million watts. Building/campus scale. A 100 MW campus is mid-tier modern AI.

N

NCCL

— NVIDIA Collective Comms LibraryNetworking

GPU-to-GPU collective communication library. The performance backbone of distributed training.

NeMo

Software

NVIDIA's full-stack framework for LLM training, fine-tuning, inference deployment.

Neocloud

Facility

GPU-as-a-service company built around AI compute. CoreWeave, Lambda, Crusoe, Nscale.

Net zero

Sustainability

Annual carbon emissions reduced to zero (after offsets/credits).

NIM

— NVIDIA Inference MicroserviceSoftware

Containerized inference deployment. Standardizes how models are packaged and served.

NVLink

Networking

NVIDIA proprietary GPU interconnect. NVLink 5 = 1.8 TB/s bidirectional per GPU on Blackwell.

NVMe

— Non-Volatile Memory ExpressStorage

PCIe-attached SSD protocol. Fast (14 GB/s/drive) and now the universal data center standard.

NVMe-oF

— NVMe over FabricsStorage

Access remote NVMe drives over network. Flavors: NVMe/RDMA (lowest latency) or NVMe/TCP (commodity).

NVSwitch

Networking

Switch fabric for NVLink. Connects up to 72 GPUs into one memory domain (NVL72).

O

Object storage

Storage

S3-style key-value blob storage. Cheap, throughput-oriented, eventual consistency. Holds datasets + model registry.

Open Compute Project (OCP)

Standards

Open hardware specs initiated by Facebook (2011). Used by Meta, Microsoft, Google. Annual Global Summit.

Open19

Standards

Open rack/server specification originated at LinkedIn. Smaller scope than OCP.

Operator (Kubernetes)

Software

Custom controller managing complex stateful apps. NVIDIA GPU Operator is the standard for K8s + GPU.

Opex

Economics

Operating expenditure — the recurring cost to run the facility (power, maintenance, staff).

OSS / OST

— Object Storage Server / TargetStorage

Lustre data nodes. Each OST is a logical storage unit; OSS hosts many OSTs.

Oversubscription

Networking

Total downstream port bandwidth ÷ upstream bandwidth. AI fabrics aim for 1:1 (non-blocking).

P

Packet spraying

Networking

Send each packet of a flow on a different path. Ultra Ethernet feature; was forbidden in classic Ethernet.

PDU

— Power Distribution UnitPower

Rack-level power strip on steroids: monitored, often 415V three-phase, feeding individual server PSUs.

PFLOPS / EFLOPS

Compute

Peta = 10^15, Exa = 10^18. NVL72 = 1.4 EFLOPS at FP4.

Pipeline parallelism

Compute

Split layers of the model across GPUs. Each GPU sees one stage. Across nodes.

Plenum

Cooling

The space below raised floors used as a cold-air supply duct in legacy DCs.

Pod

Facility

Group of 10-30 racks sharing a CDU and aggregation switches.

POSIX

Storage

File system API standard. Lustre, GPFS, WekaFS provide POSIX semantics on parallel storage.

Power density

Power

kW per rack. The single most important number in modern DC design — drives cooling, networking, and even building dimensions.

Power factor (PF)

Power

Ratio of real to apparent power. Modern AI servers run ~0.95-0.99. Capacitor banks correct lagging PF.

PPA

— Power Purchase AgreementPower

Long-term contract to buy electricity at a fixed price, often from a specific renewable project.

PPA tenor

Economics

Length of a power purchase agreement, typically 10-20 years for large industrial loads.

PSU

— Power Supply UnitPower

The box inside a server that converts AC mains to DC voltages the motherboard uses.

PUE

— Power Usage EffectivenessCooling

Total facility power ÷ IT power. 1.0 = perfect, 1.10 = hyperscale, 1.5+ = legacy.

Q

QSFP / OSFP

Networking

Pluggable optical transceiver form factors. QSFP-DD common at 400/800G; OSFP at 800G+.

Quick disconnect (QD)

Cooling

Coupling that closes off both halves of a fluid line when separated. Allows hot-swap of components.

R

Rack

Facility

Standard 42U or 48U cabinet for IT equipment.

Rail-optimized topology

Networking

Each server's nth GPU port goes to the nth rail (separate network). Reduces tail latency for collective operations.

RDMA

— Remote Direct Memory AccessNetworking

Lets one machine read/write another machine's memory without involving the CPU. Sub-microsecond latency.

Rear-door heat exchanger

Cooling

Liquid-cooled radiator on the back of an air-cooled rack. Bridges air → liquid.

REC

— Renewable Energy CertificateSustainability

Certificate proving 1 MWh of renewable generation. Tradable; basis of most 'renewable' claims.

Redundancy (N, N+1, 2N)

Power

N = exactly enough; N+1 = one extra; 2N = full duplicate path. Tier IV requires 2N.

Retail colo

Facility

Rack-by-rack or cage-by-cage leasing. Equinix's primary business model.

RoCE

— RDMA over Converged EthernetNetworking

Brings InfiniBand-like RDMA semantics to standard Ethernet.

ROCm

Compute

AMD's open-source equivalent to CUDA. Closing the gap with NVIDIA but still trails.

Run.AI

Software

GPU fractioning + dynamic scheduling for K8s. Acquired by NVIDIA in 2024 for $700M.

Runbook

Software

Standard procedure for handling a specific alarm or incident type.

S

S3

Storage

Amazon Simple Storage Service. The de-facto API for object storage; many compatible alternatives exist.

Set point

Cooling

Target temperature the BMS maintains. Modern AI sets supply air at 22-27°C, raising PUE-friendliness.

SLURM

— Simple Linux Utility for Resource ManagementSoftware

HPC workload scheduler from LLNL. Native gang scheduling. Used by xAI, Meta research, every HPC center.

SmartNIC / DPU

Networking

Network card with onboard CPU/accelerators offloading networking, storage, security from host.

SMR

— Small Modular ReactorSustainability

New generation of nuclear reactors (50-300 MW each). Google + Kairos Power signed a deal in 2024.

Sparsity

Compute

Skipping zero or near-zero values during computation. NVIDIA H100/B200 quote 2× FLOPS with structured sparsity.

Spectrum-X

Networking

NVIDIA's Ethernet platform optimized for AI fabrics. Competes with their own InfiniBand offering.

Spine / Leaf

Networking

Two-tier CLOS. Leaf = top-of-rack switches; spine = aggregation layer above.

STS

— Static Transfer SwitchPower

Sub-cycle transfer between two synchronized AC sources. Used downstream of dual UPS systems.

Subnet

Networking

InfiniBand fabric administrative domain. One subnet manager per subnet, all GPUs reachable in one hop count.

Subnet Manager (SM)

Networking

InfiniBand control-plane daemon assigning LIDs and computing routes. Critical: never crash.

Substation

Power

Where utility transmission voltage is stepped down to medium voltage for the data center.

Switchgear

Power

High-current breakers and protective relays at the medium-voltage level inside the facility.

T

Tail latency

Networking

99th-percentile round-trip time. Bigger driver of training throughput than mean latency.

Take-or-pay

Economics

Contract requiring the buyer to pay even if they don't take delivery. Standard for utility interconnect agreements.

TCO

— Total Cost of OwnershipEconomics

Capex + opex over the asset life. Often expressed as $/GPU-year.

TDP

— Thermal Design PowerCompute

Max sustained power dissipation. H100 = 700W; B200 = 1000W.

Tensor Core

Compute

Specialized matrix-multiply units inside NVIDIA GPUs since Volta. Where the FLOPS come from.

Tensor parallelism

Compute

Split a single matmul across multiple GPUs. Within a node typically; needs fast NVLink.

TGI

— Text Generation InferenceSoftware

Hugging Face's inference server. Strong production-grade alternative to vLLM.

Tier I-IV

Standards

Uptime Institute reliability classification. Tier IV = 2N redundant + concurrently maintainable + fault-tolerant.

Topology

Networking

The physical/logical arrangement of switches and links. CLOS, fat-tree, dragonfly, torus.

ToR

— Top of RackNetworking

The switch at the top of each rack that connects all servers in that rack to the fabric.

TorchElastic

Software

PyTorch fault-tolerant training — handles node failures dynamically without restarting the whole job.

TPU

— Tensor Processing UnitCompute

Google's custom AI accelerator. Only available inside Google Cloud / internally.

Trainium

Compute

AWS custom AI training chip. Trainium2 (2024) powers Project Rainier with Anthropic.

Transformer

Power

Steps voltage from medium (13.8/34.5 kV) to distribution (480 V US, 400 V EU).

Triton

Software

Both a JIT compiler (OpenAI Triton) and an inference server (NVIDIA Triton). Confusingly named.

Triton Inference Server

Software

NVIDIA's open inference server. Multi-framework, multi-model, dynamic batching.

U

Facility

Rack unit = 44.45 mm (1.75 inches). A '1U server' is one slot tall.

Ultra Ethernet

Networking

Open multi-vendor consortium effort (AMD, Broadcom, Cisco, Meta, Microsoft, Oracle...) targeting AI workloads at 1.6T. Production 2025-2026.

UPS

— Uninterruptible Power SupplyPower

Battery or flywheel system that bridges the <10s gap between grid loss and generator startup.

Useful life

Economics

Accounting depreciation period. Microsoft, Meta extended to 5.5-6 years on AI servers in 2022-2024.

V

VAST Data

Storage

DASE (Disaggregated Shared Everything) all-flash architecture. CoreWeave is a notable customer.

vGPU

Software

NVIDIA's software GPU virtualization. Different from MIG: time-sliced, lower isolation.

vLLM

Software

Open-source LLM inference server. Continuous batching, paged attention. Industry standard.

Volcano

Software

K8s scheduler add-on built specifically for AI/HPC workloads. Adds gang scheduling, fair share, queues.

VRAM

Compute

GPU memory. For AI accelerators it's HBM specifically.

W

Water positive

Sustainability

Replenishing more water than consumed. Microsoft's 2030 commitment.

WekaFS

Storage

Software-defined NVMe-only parallel FS. Strong AI mindshare; deployed at Stability AI, Cohere.

Whip

Power

Flexible cable from busway tap to rack PDU. Color-coded per phase.

White space

Facility

The IT rack area. Distinct from 'gray space' (mechanical rooms).

Wholesale colo

Facility

Whole data hall or building leased to one customer. Common for hyperscalers leasing from Digital Realty, CyrusOne.

WUE

— Water Usage EffectivenessCooling

Liters of water per kWh of IT energy. Evap ≈ 1.8 L/kWh, dry cooling ≈ 0.

Z

ZeRO

Compute

DeepSpeed memory optimization. Shards optimizer/gradients/weights across GPUs to fit larger models.

Need more depth?

Every term here is covered in the 12-lesson curriculum. Looking for hands-on training? See the curated courses page.