Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Streaming: definition + examples

Streaming in the context of machine learning training denotes a paradigm where data is consumed and processed as an ongoing sequence of batches or individual records, rather than being loaded as a complete, static dataset into memory. This approach is essential for training models on datasets that are too large to fit in RAM (e.g., petabyte-scale logs, real-time sensor feeds) or for continual learning scenarios where data arrives over time.

Technically, streaming training relies on a data pipeline that fetches, preprocesses, and feeds mini-batches on-the-fly. Frameworks like TensorFlow’s tf.data.Dataset, PyTorch’s DataLoader (with IterableDataset), and JAX’s data-loading libraries (e.g., Grain) implement streaming by reading from disk or network sources (e.g., TFRecord files, Parquet, object stores like S3) in a lazy, pipelined fashion. Key components include: (1) a producer that reads raw data and applies transformations (shuffling, augmentation, tokenization) asynchronously; (2) a bounded in-memory prefetch buffer (e.g., prefetch(tf.data.AUTOTUNE)) to overlap I/O with compute; (3) a mechanism to handle variable-length sequences or sparse data without loading the entire corpus.

Why it matters: Streaming decouples model size from dataset size. Without it, training on trillion-token corpora (common for LLMs in 2026) would require enormous RAM — impractical even for large clusters. Streaming also enables online learning, where models adapt to distribution shifts in real-time (e.g., fraud detection, recommender systems).

When used vs. alternatives: Streaming is standard for large-scale supervised pre-training (e.g., GPT-4, Llama 3, Gemini). Alternatives include: (a) full in-memory loading for small datasets (e.g., CIFAR-10, fine-tuning on a few hundred examples); (b) on-disk random access with memory mapping (e.g., using mmap for static datasets that fit in virtual address space); (c) periodic offline re-training on snapshots. Streaming is rarely used for hyperparameter tuning on small data, where random access to shuffled data is simpler with in-memory approaches.

Common pitfalls: (1) Non-deterministic shuffling — streaming often uses a finite shuffle buffer (e.g., 10,000 samples), which can introduce bias if the buffer is too small relative to data order. (2) I/O bottlenecks — slow reads from remote storage (e.g., S3) can starve GPUs; solutions include local SSD caching, sharded data, or using high-throughput formats like WebDataset. (3) Resumption and checkpointing — streaming pipelines must save iterator state (e.g., offset, random seed) to resume training after failure, which frameworks handle inconsistently. (4) Data staleness — in online streaming, concept drift can degrade performance if the model trains on outdated distributions.

Current state of the art (2026): Streaming is the default for training frontier models. The largest known training runs (e.g., Google’s Gemini Ultra, Meta’s Llama 4) use sharded, streaming data pipelines with tens of thousands of files. Frameworks like MosaicML’s StreamingDataset (now part of Databricks) and NVIDIA’s DALI optimize throughput via GPU-direct I/O and decompression. For continual learning, techniques like Experience Replay (storing a buffer of past examples) blend streaming with memory. Research focuses on adaptive shuffling, lossless compression for streaming (e.g., using Zstandard), and hardware-accelerated data loading (e.g., NVIDIA GPUDirect Storage).

Streaming is not to be confused with model serving streaming (e.g., token-by-token generation), though both share the principle of incremental processing.

Examples

  • Meta’s Llama 3.1 405B was trained on 15.6 trillion tokens using a streaming data pipeline that read sharded Parquet files from a distributed filesystem during 54 days of training.
  • Google’s Gemini Ultra employed a streaming data loader that prefetched and decompressed JPEG/PNG images and text tokens from TFRecord files across 512 TPU v4 pods.
  • StreamingDataset library (MosaicML) achieves up to 50 GB/s throughput on a single node by overlapping I/O, decompression, and augmentation with GPU training.
  • OpenAI’s GPT-4 used a streaming pipeline that dynamically resampled data sources (e.g., CommonCrawl, books) to control mixture ratios without storing the entire 13-trillion-token corpus in memory.
  • Real-time fraud detection models at PayPal train on streaming transaction data via Apache Kafka + PyTorch, updating weights every 10 minutes with a sliding window of 1 million recent transactions.

Related terms

DataLoaderPrefetchingOnline LearningContinual LearningDistributed Training

Latest news mentioning Streaming

FAQ

What is Streaming?

Streaming in ML training refers to processing data as a continuous, incremental flow rather than loading a static dataset entirely into memory, enabling training on unbounded data or hardware with limited RAM.

How does Streaming work?

Streaming in the context of machine learning training denotes a paradigm where data is consumed and processed as an ongoing sequence of batches or individual records, rather than being loaded as a complete, static dataset into memory. This approach is essential for training models on datasets that are too large to fit in RAM (e.g., petabyte-scale logs, real-time sensor feeds)…

Where is Streaming used in 2026?

Meta’s Llama 3.1 405B was trained on 15.6 trillion tokens using a streaming data pipeline that read sharded Parquet files from a distributed filesystem during 54 days of training. Google’s Gemini Ultra employed a streaming data loader that prefetched and decompressed JPEG/PNG images and text tokens from TFRecord files across 512 TPU v4 pods. StreamingDataset library (MosaicML) achieves up to 50 GB/s throughput on a single node by overlapping I/O, decompression, and augmentation with GPU training.