Practice

Stop reading. Start designing.

Theory is necessary but not sufficient. Below is an interactive AI data center designer that puts the lessons into your hands. Pick scale, hardware, cooling, redundancy — see the numbers a real design team would see. Below that: cloud labs where you can actually rent GPUs and incident scenarios that walk through real decisions.

🛠️ Data Center Designer

Live calculation — no signup, no save state, just play

Quick presets

⚙️ Configuration

Total power (MW)100 MW

1 MW500 MW1 GW2.5 GW

Accelerator

72 B200 + 36 Grace CPUs in one liquid-cooled rack. Single NVLink domain. Cost shown is per-GPU equivalent.

Cooling

Cold plates on chips. Required for B200/GB200. Standard for modern AI.

Location

Hyperscaler favorite. Cool climate = great PUE. Mid grid mix.

Reliability tier

📊 Live results

GPUs

54,347

across ~754 racks

Total capex

$3.65B

capex incl. silicon

Annual opex

$196.9M

$48.2M on power

PUE

1.10

lower = more efficient

EFLOPS (FP8)

244.56

theoretical peak

CO₂/yr

262,800 t

based on grid mix

Capex breakdown

GPUs / accelerators: $2.23B

Facility shell + power: $1.08B

Cooling system: $130.0M

Networking: $217.4M

🚂 Training capability (rough estimate)

Llama-3.1 70B from scratch2.8 hours

Llama-3.1 405B from scratch1.2 days

Assumes 40% MFU. Actual times vary widely with code, communication overhead, restarts.

Realistic time-to-build

~33 months

incl. 30mo interconnect

Calculation notes: PUE math is simplified (cooling baseline × climate factor). Capex includes silicon at street prices. Opex includes power, ~4% maintenance, and minimum staff floor of $2M/yr. Training estimates assume 40% MFU. Real designs require working with mech/electrical engineers — this tool teaches relationships, not blueprints.

☁️ Rent real GPUs by the hour

The fastest way to gain real experience: spin up an actual GPU instance for a few dollars, run a workload, observe what happens. These providers are where the experimentation happens.

RunPod

~$2.00/hr

Affordable on-demand + spot

Probably the cheapest reliable H100 spot pricing for individuals. Community + secure cloud variants.

Vast.ai

~$1.50/hr

Marketplace (cheapest)

Decentralized GPU marketplace. Lowest prices but variable reliability — perfect for experimentation.

Lambda Cloud

~$2.49/hr

Reliable cloud, AI-focused

Built specifically for ML workloads. Pre-installed CUDA stack. Easier than AWS for newcomers.

CoreWeave

Contract

Production neocloud

Real production-scale clusters with InfiniBand fabrics. For when you've outgrown spot instances.

AWS EC2 P5 (H100)

~$13.10/hr

Hyperscaler grade

Full enterprise stack: VPC, IAM, S3 integration. Expensive but production-realistic.

Google Cloud A3

~$11.06/hr

Hyperscaler grade

H100s on GCP. Tight integration with TPU v5/v6 if you want to compare.

Azure ND H100 v5

~$12.30/hr

Hyperscaler grade

Azure's H100 series. Best for Microsoft-stack shops.

NVIDIA NGC Catalog

Free

Container library (free)

Pre-built containers for every major framework. Use these on top of any provider above.

Hugging Face Spaces (ZeroGPU)

Free / Pro $9/mo

Free for demos

Free shared H200 access for inference demos. Great for trying a model before renting.

Prices are approximate spot/on-demand for an H100 80GB. Always check current pricing — GPU markets shift weekly.

🎯 Your first hands-on exercise (under $5)

Sign up for RunPod or Vast.ai.
Spin up a single H100 80GB spot instance (~$2-3/hour).
SSH in. Run nvidia-smi — confirm the GPU is alive.
Pull the official NVIDIA PyTorch container: docker pull nvcr.io/nvidia/pytorch:25.04-py3
Run a 30-minute Llama 3.1 8B fine-tune from Hugging Face.
Watch GPU util and memory in nvtop. Note your tokens/sec.
Tear it all down. You've spent ~$3 and just done what most people only read about.

🧰 More hands-on tools

🔧

Geographic PUE Calculator

Pick a city + cooling. Get annual PUE, WUE, CUE, and free-cooling hours from real ASHRAE/grid data. 20 markets covered.

Open the calculator →

⚡

Power Budget Game

"Train Llama-405B in 90 days with 200 MW + $2B." Design a cluster that wins all three constraints.

Play →

🚨

Incident Scenarios

3 AM page. Coolant leak, grid loss, spine switch failure. Branching runbooks — see consequences of each decision.

Take a page →

🧮

Capacity Planning Worksheet

Power, cooling, space — three independent constraints. Find the binding one. Minimize stranded capacity.

Open the worksheet →