Practice
Stop reading. Start designing.
Theory is necessary but not sufficient. Below is an interactive AI data center designer that puts the lessons into your hands. Pick scale, hardware, cooling, redundancy — see the numbers a real design team would see. Below that: cloud labs where you can actually rent GPUs and incident scenarios that walk through real decisions.
🛠️ Data Center Designer
Live calculation — no signup, no save state, just playQuick presets
⚙️ Configuration
72 B200 + 36 Grace CPUs in one liquid-cooled rack. Single NVLink domain. Cost shown is per-GPU equivalent.
Cold plates on chips. Required for B200/GB200. Standard for modern AI.
Hyperscaler favorite. Cool climate = great PUE. Mid grid mix.
📊 Live results
Capex breakdown
🚂 Training capability (rough estimate)
Assumes 40% MFU. Actual times vary widely with code, communication overhead, restarts.
Calculation notes: PUE math is simplified (cooling baseline × climate factor). Capex includes silicon at street prices. Opex includes power, ~4% maintenance, and minimum staff floor of $2M/yr. Training estimates assume 40% MFU. Real designs require working with mech/electrical engineers — this tool teaches relationships, not blueprints.
☁️ Rent real GPUs by the hour
The fastest way to gain real experience: spin up an actual GPU instance for a few dollars, run a workload, observe what happens. These providers are where the experimentation happens.
RunPod
~$2.00/hrAffordable on-demand + spot
Probably the cheapest reliable H100 spot pricing for individuals. Community + secure cloud variants.
Vast.ai
~$1.50/hrMarketplace (cheapest)
Decentralized GPU marketplace. Lowest prices but variable reliability — perfect for experimentation.
Lambda Cloud
~$2.49/hrReliable cloud, AI-focused
Built specifically for ML workloads. Pre-installed CUDA stack. Easier than AWS for newcomers.
CoreWeave
ContractProduction neocloud
Real production-scale clusters with InfiniBand fabrics. For when you've outgrown spot instances.
AWS EC2 P5 (H100)
~$13.10/hrHyperscaler grade
Full enterprise stack: VPC, IAM, S3 integration. Expensive but production-realistic.
Google Cloud A3
~$11.06/hrHyperscaler grade
H100s on GCP. Tight integration with TPU v5/v6 if you want to compare.
Azure ND H100 v5
~$12.30/hrHyperscaler grade
Azure's H100 series. Best for Microsoft-stack shops.
NVIDIA NGC Catalog
FreeContainer library (free)
Pre-built containers for every major framework. Use these on top of any provider above.
Hugging Face Spaces (ZeroGPU)
Free / Pro $9/moFree for demos
Free shared H200 access for inference demos. Great for trying a model before renting.
Prices are approximate spot/on-demand for an H100 80GB. Always check current pricing — GPU markets shift weekly.
🎯 Your first hands-on exercise (under $5)
- Sign up for RunPod or Vast.ai.
- Spin up a single H100 80GB spot instance (~$2-3/hour).
- SSH in. Run
nvidia-smi— confirm the GPU is alive. - Pull the official NVIDIA PyTorch container:
docker pull nvcr.io/nvidia/pytorch:25.04-py3 - Run a 30-minute Llama 3.1 8B fine-tune from Hugging Face.
- Watch GPU util and memory in
nvtop. Note your tokens/sec. - Tear it all down. You've spent ~$3 and just done what most people only read about.
🧰 More hands-on tools
🔧
Geographic PUE Calculator
Pick a city + cooling. Get annual PUE, WUE, CUE, and free-cooling hours from real ASHRAE/grid data. 20 markets covered.
Open the calculator →
⚡
Power Budget Game
"Train Llama-405B in 90 days with 200 MW + $2B." Design a cluster that wins all three constraints.
Play →
🚨
Incident Scenarios
3 AM page. Coolant leak, grid loss, spine switch failure. Branching runbooks — see consequences of each decision.
Take a page →
🧮
Capacity Planning Worksheet
Power, cooling, space — three independent constraints. Find the binding one. Minimize stranded capacity.
Open the worksheet →