Lesson 06/12Intermediate13 min read·3 diagrams

Storage Architecture

Foundation model training pipelines move 100 GB/s of checkpoints, datasets, and intermediate state. This lesson covers parallel filesystems (Lustre, WekaFS, VAST), NVMe-over-fabrics, checkpoint strategies, and where object storage fits.

1 · The four-tier storage hierarchy

An AI cluster moves data through four tiers, each at very different cost and performance:

HBM (on-chip)
~10 TB/s/GPU
Active model weights
GPU memory
Not really storage; it's the working set
NVMe (per node)
~14 GB/s/drive
Hot dataset, scratch
Parallel FS
~1-100 TB/s aggregate
Datasets + checkpoints
Object storage
~1-10 GB/s/user
Cold archive, model registry

2 · Parallel filesystems — the heart of training

Parallel filesystems split files across many storage servers so a single client read can saturate multiple drives. The leaders today:

Lustre

Open-source HPC veteran. Battle-tested on supercomputers (Frontier, El Capitan). Complex to operate; OST/MDS architecture. Used by Meta for some training storage.

IBM Storage Scale (formerly GPFS / Spectrum Scale)

Mature, integrated with object/file/NFS gateways. Used by national labs and many enterprise AI deployments.

WekaFS

Software-defined, NVMe-only, POSIX + S3 + NFS. Strong AI mindshare — deployed at Stability AI, Cohere, others. Linear scaling claims.

VAST Data

DASE (Disaggregated Shared Everything) architecture. Single global namespace, all-flash, used by CoreWeave and others.

DDN Lustre / Exascaler / Infinia

DDN is the dominant supplier of HPC-grade Lustre boxes; their newer Infinia product targets AI workloads with object semantics.

3 · Checkpoints — the hidden bandwidth tax

Training a 1T-parameter model means saving model state every few hours (or even more often) so you don't lose work to a failed GPU. Each checkpoint is roughly the size of the model — for a 1T BF16 model, that's ~2 TB.

On a 100k-GPU cluster checkpointing every 30 minutes, your storage system must absorb ~2 TB without stalling training. Modern checkpoint frameworks use asynchronous distributed checkpointing (PyTorch DCP, NVIDIA NeMo) to overlap saves with compute.

4 · NVMe-over-fabrics

NVMe-oF lets a server access remote NVMe drives over the network as if they were local. Two flavors dominate:

  • NVMe/RDMA (over RoCE or InfiniBand) — sub-microsecond latency, hardware-offloaded.
  • NVMe/TCP — slightly higher latency but works over standard Ethernet.

NVMe-oF is what enables disaggregated storage: instead of every server having local SSDs, you put all the SSDs in dedicated storage nodes and serve them over the fabric.

5 · Object storage

S3-compatible object storage (Amazon S3, MinIO, Pure FlashBlade, Cloudian, etc.) is where datasets and trained models live. It's eventually consistent, throughput-oriented, and cheap — perfect for billions of small training samples but too slow for the active training loop.

Typical pipeline: S3 (cold) → parallel FS (hot) → GPU memory. Caches like Alluxio sit in between to mask the latency.

Source: WekaFS architecture papers; VAST Data DASE white paper; Lustre documentation; PyTorch Distributed Checkpoint (DCP) docs.

Lesson 06 — TL;DR

  • • 4 tiers: HBM → NVMe (per node) → parallel FS → object storage.
  • • Parallel filesystems (WekaFS, VAST, Lustre, IBM Storage Scale, DDN) handle hot data at TB/s.
  • • Checkpoints are the hidden bandwidth tax — async distributed checkpointing solves it.
  • • NVMe-oF enables disaggregated storage — fewer drives, higher utilization.
  • • Object storage holds the cold dataset and model registry.

Useful? Share so the next engineer learns this faster.

Share: