Software & Orchestration
Bare metal alone doesn't train models. The orchestration stack — schedulers, container runtimes, monitoring, fault recovery — is what makes a 30,000-GPU cluster usable. SLURM vs Kubernetes, Run.AI, NVIDIA Base Command, and how gang scheduling actually works.
1 · The orchestration stack
From bare metal up:
- Provisioning — PXE boot, image management (Foreman, MaaS, Bright Cluster Manager).
- Container runtime — Docker, containerd, NVIDIA Container Toolkit (for GPU passthrough).
- Scheduler — SLURM, Kubernetes, Run.AI, Volcano. Decides what runs where.
- Workload framework — PyTorch + DeepSpeed / Megatron-LM, JAX + XLA, NeMo. The training code itself.
- Observability — DCGM (GPU metrics), Prometheus, Grafana, Slurm/job logs, distributed tracing.
2 · SLURM vs Kubernetes
SLURM
HPC origin (LLNL, 2002). Native gang scheduling, MPI-aware, low-overhead. Used by every major HPC center and many AI labs (xAI, Meta research clusters). Painful for cloud-native services.
Kubernetes
Cloud-native standard. Needs gang scheduling extensions (Kueue, Volcano) and topology-aware scheduling for AI. Better for multi-tenant + inference; weaker for tightly-coupled training.
3 · Gang scheduling — the hard requirement
A 1,000-GPU training job needs all 1,000 GPUs at once or none of them. There's no value in starting with 999 — every collective operation will block until the missing one shows up.
Gang scheduling guarantees this all-or-nothing semantics. SLURM has it natively. Kubernetes needs Kueue, Volcano, or YuniKorn. Without it, you get head-of-line blocking: small jobs starve large ones, large jobs hold partial allocations forever.
4 · Fault tolerance at scale
On 100,000 GPUs you have a GPU failure roughly every 30 minutes. Mean-time-between-failure math is brutal at that scale. Modern training stacks handle it three ways:
- Async checkpointing — overlap saves with compute (PyTorch DCP).
- Hot spares — keep 1–2% of nodes idle, ready to swap in.
- Elastic training — frameworks (TorchElastic, NeMo) that resize the world dynamically when nodes drop.
5 · Vendor stacks
- NVIDIA Base Command Manager (formerly Bright) — provisioning + monitoring + workload management for NVIDIA reference clusters.
- Run.AI (acquired by NVIDIA, $700M, 2024) — GPU fractioning + dynamic scheduling on K8s.
- Determined AI (acquired by HPE, 2021) — open-source training platform with hyperparameter tuning.
- Coreweave Mission Control / Lambda Stack — neocloud-built orchestration for their GPU-as-a-service offerings.
Source: SLURM documentation; Kubernetes Kueue project; PyTorch Elastic / DCP docs; NVIDIA Base Command Manager product page; Run.AI / Determined acquisition press releases.
Lesson 07 — TL;DR
- • 5-layer stack: provisioning → containers → scheduler → framework → observability.
- • SLURM dominates training; Kubernetes dominates inference. Many shops run both.
- • Gang scheduling is non-negotiable for distributed training.
- • At 100k GPUs, failures happen every ~30 min. Async checkpointing + hot spares are mandatory.
- • NVIDIA's Base Command + Run.AI tighten the vertical lock-in.