Lesson 07/12Advanced14 min read·3 diagrams

Software & Orchestration

Bare metal alone doesn't train models. The orchestration stack — schedulers, container runtimes, monitoring, fault recovery — is what makes a 30,000-GPU cluster usable. SLURM vs Kubernetes, Run.AI, NVIDIA Base Command, and how gang scheduling actually works.

1 · The orchestration stack

From bare metal up:

  1. Provisioning — PXE boot, image management (Foreman, MaaS, Bright Cluster Manager).
  2. Container runtime — Docker, containerd, NVIDIA Container Toolkit (for GPU passthrough).
  3. Scheduler — SLURM, Kubernetes, Run.AI, Volcano. Decides what runs where.
  4. Workload framework — PyTorch + DeepSpeed / Megatron-LM, JAX + XLA, NeMo. The training code itself.
  5. Observability — DCGM (GPU metrics), Prometheus, Grafana, Slurm/job logs, distributed tracing.

2 · SLURM vs Kubernetes

SLURM

HPC origin (LLNL, 2002). Native gang scheduling, MPI-aware, low-overhead. Used by every major HPC center and many AI labs (xAI, Meta research clusters). Painful for cloud-native services.

Kubernetes

Cloud-native standard. Needs gang scheduling extensions (Kueue, Volcano) and topology-aware scheduling for AI. Better for multi-tenant + inference; weaker for tightly-coupled training.

3 · Gang scheduling — the hard requirement

A 1,000-GPU training job needs all 1,000 GPUs at once or none of them. There's no value in starting with 999 — every collective operation will block until the missing one shows up.

Gang scheduling guarantees this all-or-nothing semantics. SLURM has it natively. Kubernetes needs Kueue, Volcano, or YuniKorn. Without it, you get head-of-line blocking: small jobs starve large ones, large jobs hold partial allocations forever.

4 · Fault tolerance at scale

On 100,000 GPUs you have a GPU failure roughly every 30 minutes. Mean-time-between-failure math is brutal at that scale. Modern training stacks handle it three ways:

  • Async checkpointing — overlap saves with compute (PyTorch DCP).
  • Hot spares — keep 1–2% of nodes idle, ready to swap in.
  • Elastic training — frameworks (TorchElastic, NeMo) that resize the world dynamically when nodes drop.

5 · Vendor stacks

  • NVIDIA Base Command Manager (formerly Bright) — provisioning + monitoring + workload management for NVIDIA reference clusters.
  • Run.AI (acquired by NVIDIA, $700M, 2024) — GPU fractioning + dynamic scheduling on K8s.
  • Determined AI (acquired by HPE, 2021) — open-source training platform with hyperparameter tuning.
  • Coreweave Mission Control / Lambda Stack — neocloud-built orchestration for their GPU-as-a-service offerings.

Source: SLURM documentation; Kubernetes Kueue project; PyTorch Elastic / DCP docs; NVIDIA Base Command Manager product page; Run.AI / Determined acquisition press releases.

Lesson 07 — TL;DR

  • • 5-layer stack: provisioning → containers → scheduler → framework → observability.
  • • SLURM dominates training; Kubernetes dominates inference. Many shops run both.
  • • Gang scheduling is non-negotiable for distributed training.
  • • At 100k GPUs, failures happen every ~30 min. Async checkpointing + hot spares are mandatory.
  • • NVIDIA's Base Command + Run.AI tighten the vertical lock-in.

Useful? Share so the next engineer learns this faster.

Share: