Storage Architecture
Foundation model training pipelines move 100 GB/s of checkpoints, datasets, and intermediate state. This lesson covers parallel filesystems (Lustre, WekaFS, VAST), NVMe-over-fabrics, checkpoint strategies, and where object storage fits.
1 · The four-tier storage hierarchy
An AI cluster moves data through four tiers, each at very different cost and performance:
2 · Parallel filesystems — the heart of training
Parallel filesystems split files across many storage servers so a single client read can saturate multiple drives. The leaders today:
Lustre
Open-source HPC veteran. Battle-tested on supercomputers (Frontier, El Capitan). Complex to operate; OST/MDS architecture. Used by Meta for some training storage.
IBM Storage Scale (formerly GPFS / Spectrum Scale)
Mature, integrated with object/file/NFS gateways. Used by national labs and many enterprise AI deployments.
WekaFS
Software-defined, NVMe-only, POSIX + S3 + NFS. Strong AI mindshare — deployed at Stability AI, Cohere, others. Linear scaling claims.
VAST Data
DASE (Disaggregated Shared Everything) architecture. Single global namespace, all-flash, used by CoreWeave and others.
DDN Lustre / Exascaler / Infinia
DDN is the dominant supplier of HPC-grade Lustre boxes; their newer Infinia product targets AI workloads with object semantics.
3 · Checkpoints — the hidden bandwidth tax
Training a 1T-parameter model means saving model state every few hours (or even more often) so you don't lose work to a failed GPU. Each checkpoint is roughly the size of the model — for a 1T BF16 model, that's ~2 TB.
On a 100k-GPU cluster checkpointing every 30 minutes, your storage system must absorb ~2 TB without stalling training. Modern checkpoint frameworks use asynchronous distributed checkpointing (PyTorch DCP, NVIDIA NeMo) to overlap saves with compute.
4 · NVMe-over-fabrics
NVMe-oF lets a server access remote NVMe drives over the network as if they were local. Two flavors dominate:
- NVMe/RDMA (over RoCE or InfiniBand) — sub-microsecond latency, hardware-offloaded.
- NVMe/TCP — slightly higher latency but works over standard Ethernet.
NVMe-oF is what enables disaggregated storage: instead of every server having local SSDs, you put all the SSDs in dedicated storage nodes and serve them over the fabric.
5 · Object storage
S3-compatible object storage (Amazon S3, MinIO, Pure FlashBlade, Cloudian, etc.) is where datasets and trained models live. It's eventually consistent, throughput-oriented, and cheap — perfect for billions of small training samples but too slow for the active training loop.
Typical pipeline: S3 (cold) → parallel FS (hot) → GPU memory. Caches like Alluxio sit in between to mask the latency.
Source: WekaFS architecture papers; VAST Data DASE white paper; Lustre documentation; PyTorch Distributed Checkpoint (DCP) docs.
Lesson 06 — TL;DR
- • 4 tiers: HBM → NVMe (per node) → parallel FS → object storage.
- • Parallel filesystems (WekaFS, VAST, Lustre, IBM Storage Scale, DDN) handle hot data at TB/s.
- • Checkpoints are the hidden bandwidth tax — async distributed checkpointing solves it.
- • NVMe-oF enables disaggregated storage — fewer drives, higher utilization.
- • Object storage holds the cold dataset and model registry.