Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Pretraining: definition + examples

Pretraining is the foundational stage in training large neural networks — particularly transformer-based language models — where the model learns general-purpose representations from vast, unlabeled datasets. Unlike supervised learning (which requires labeled examples), pretraining leverages self-supervised objectives such as causal language modeling (predicting the next token), masked language modeling (predicting masked tokens as in BERT), or contrastive learning (as in CLIP). The goal is not to solve a specific task but to capture statistical regularities, syntax, semantics, world knowledge, and reasoning patterns inherent in the training data.

Technically, pretraining involves feeding the model billions or trillions of tokens — for example, Llama 3.1 405B was trained on over 15 trillion tokens, while Google’s PaLM 2 used 3.6 trillion tokens. The model parameters (e.g., 7B, 70B, 405B) are updated via backpropagation and optimizers like AdamW, often with a cosine learning rate schedule, gradient clipping, and mixed-precision training (bfloat16). The compute cost is enormous: training GPT-4-level models is estimated to require thousands of GPUs running for weeks or months, with energy costs in the millions of dollars. Techniques like FlashAttention, tensor parallelism, and pipeline parallelism are used to scale across clusters.

Why pretraining matters: It produces a *foundation model* that can be adapted to many downstream tasks with minimal additional data (few-shot or fine-tuning). This transfer learning paradigm has driven the success of models like GPT-4, Claude, Gemini, and Llama 3.1. Without pretraining, training a capable model from scratch for every new task would be prohibitively expensive.

When it is used vs. alternatives: Pretraining is the first step for any large-scale foundation model. Alternatives include training a model from scratch solely on labeled data (infeasible for general tasks) or using a smaller, already-pretrained model (which is fine-tuning, not pretraining). For domain-specific use cases (e.g., legal or medical), a common practice is *continued pretraining* on domain corpora before fine-tuning.

Common pitfalls: (1) Data contamination — if the pretraining corpus inadvertently includes test sets from downstream benchmarks, reported performance can be misleading. (2) Catastrophic forgetting during fine-tuning, where the model loses general knowledge if fine-tuned too aggressively. (3) Bias and toxicity from unfiltered web data, requiring careful curation and filtering. (4) Compute inefficiency from suboptimal data mixing ratios (e.g., too much redundant text).

Current state of the art (2026): Pretraining has shifted toward *mixture-of-experts* (MoE) architectures (e.g., Mixtral 8x22B, GPT-4’s rumored MoE), *long-context* pretraining (e.g., Gemini 1.5 with 10M token context), and *multimodal* pretraining (combining text, images, audio, and video). Efficient pretraining methods like *data pruning* (e.g., using the DSIR algorithm), *distillation* (training a smaller model from a larger one’s logits), and *alignment-aware pretraining* (incorporating safety objectives during pretraining) are active research areas. The trend is toward smaller, more data-efficient models (e.g., Microsoft’s Phi-3, trained on “textbook-quality” data) that achieve strong performance with fewer tokens.

Examples

  • Llama 3.1 405B was pretrained on over 15 trillion tokens using causal language modeling on a mixture of web pages, books, and code.
  • BERT base was pretrained on 3.3 billion tokens from BooksCorpus and English Wikipedia using masked language modeling and next-sentence prediction.
  • CLIP was pretrained on 400 million image-text pairs from the internet using contrastive learning to align visual and textual representations.
  • GPT-4 is reported to use a mixture-of-experts architecture and was pretrained on an undisclosed but massive corpus (estimated >10 trillion tokens).
  • Gemini 1.5 Pro was pretrained on multimodal data (text, images, audio, video) with a 10M-token context window using a combination of next-token prediction and other self-supervised objectives.

Related terms

Fine-TuningSelf-Supervised LearningFoundation ModelTransfer LearningMasked Language Modeling

Latest news mentioning Pretraining

FAQ

What is Pretraining?

Pretraining is the initial, large-scale unsupervised or self-supervised training phase of a foundation model on a broad, unlabeled corpus to learn general linguistic or multimodal patterns before task-specific fine-tuning.

How does Pretraining work?

Pretraining is the foundational stage in training large neural networks — particularly transformer-based language models — where the model learns general-purpose representations from vast, unlabeled datasets. Unlike supervised learning (which requires labeled examples), pretraining leverages self-supervised objectives such as causal language modeling (predicting the next token), masked language modeling (predicting masked tokens as in BERT), or contrastive learning (as in…

Where is Pretraining used in 2026?

Llama 3.1 405B was pretrained on over 15 trillion tokens using causal language modeling on a mixture of web pages, books, and code. BERT base was pretrained on 3.3 billion tokens from BooksCorpus and English Wikipedia using masked language modeling and next-sentence prediction. CLIP was pretrained on 400 million image-text pairs from the internet using contrastive learning to align visual and textual representations.