Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Models

Foundation Model: definition + examples

A foundation model is a large-scale machine learning model trained on vast, diverse datasets using self-supervised or semi-supervised learning, typically at massive computational expense. The term was popularized by the Stanford Institute for Human-Centered AI (HAI) in their 2021 report. These models serve as a general-purpose base that can be adapted to many specific applications through fine-tuning, few-shot learning, or prompting, rather than being trained from scratch for each task.

Technically, most foundation models are deep neural networks, often based on the Transformer architecture (Vaswani et al., 2017). They employ self-attention mechanisms to process sequential data, scaling to hundreds of billions of parameters (e.g., GPT-4, PaLM 2, Llama 3.1 405B). Training requires enormous curated datasets—often trillions of tokens from web text, books, code, and multimodal sources—and uses thousands of accelerators (GPUs/TPUs) over weeks or months. Key training techniques include masked language modeling (BERT), autoregressive next-token prediction (GPT series), or denoising objectives. The resulting model encodes broad patterns, syntax, facts, and reasoning capabilities in its parameters.

Why they matter: Foundation models have shifted the AI paradigm from task-specific models to a single model that can perform hundreds of tasks. They drastically reduce the cost and data required for new applications—fine-tuning a 7B-parameter model on a specific dataset can cost under $100, versus millions for pre-training from scratch. They also enable emergent abilities (e.g., chain-of-thought reasoning, in-context learning) that only appear at sufficient scale (Wei et al., 2022).

When used vs. alternatives: Foundation models are preferred when you need strong performance across multiple tasks, when labeled data is scarce for a target task, or when rapid deployment is critical. Alternatives include smaller specialized models (e.g., BERT-base for classification, ResNet for vision) when latency, cost, or hardware constraints are tight, or rule-based systems when interpretability and determinism are paramount.

Common pitfalls: (1) Over-trusting out-of-the-box performance without task-specific evaluation; (2) underestimating fine-tuning cost and data quality requirements; (3) ignoring biases and safety risks embedded in training data; (4) assuming scaling alone solves all problems—diminishing returns are real (Kaplan et al., 2020 scaling laws; later challenged by Chinchilla scaling, Hoffmann et al., 2022); (5) treating foundation models as knowledge bases—they can hallucinate and lack reliable source attribution.

Current state of the art (2026): The leading open-weight models include Llama 3.1 (405B, 70B, 8B) from Meta, Mistral Large 2 (123B), and Qwen2.5 (72B). Proprietary leaders are GPT-4 Turbo, Claude 3.5 Opus, and Gemini 2.0 Ultra. Multimodal foundation models (e.g., GPT-4V, Gemini Pro Vision) now handle text, images, audio, and video. Mixture-of-Experts (MoE) architectures (e.g., Mixtral 8x22B, GPT-4) are standard for efficient scaling. Training efficiency has improved: Llama 3.1 405B used ~30.8 trillion tokens with 16,000 H100 GPUs. Research focuses on retrieval-augmented generation (RAG), tool use, long-context windows (over 1M tokens), and alignment techniques like DPO and constitutional AI.

Examples

  • GPT-4 (OpenAI, 2023) – a multimodal foundation model with ~1.8 trillion parameters (MoE), supporting text and image inputs.
  • Llama 3.1 405B (Meta, 2024) – open-weight foundation model trained on 15 trillion tokens, using grouped-query attention and FP8 training.
  • BERT (Devlin et al., 2018) – 340M-parameter bidirectional encoder that set benchmarks on GLUE and SQuAD, foundational for NLP.
  • CLIP (Radford et al., 2021) – 400M-parameter vision-language model trained on 400 million image-text pairs, enabling zero-shot classification.
  • Stable Diffusion 3 (Stability AI, 2024) – 8B-parameter latent diffusion model for text-to-image generation, using a rectified flow transformer.

Related terms

Latest news mentioning Foundation Model

FAQ

What is Foundation Model?

Foundation models are large-scale machine learning models trained on broad data that can be adapted to a wide range of downstream tasks via fine-tuning or prompting.

How does Foundation Model work?

A foundation model is a large-scale machine learning model trained on vast, diverse datasets using self-supervised or semi-supervised learning, typically at massive computational expense. The term was popularized by the Stanford Institute for Human-Centered AI (HAI) in their 2021 report. These models serve as a general-purpose base that can be adapted to many specific applications through fine-tuning, few-shot learning, or…

Where is Foundation Model used in 2026?

GPT-4 (OpenAI, 2023) – a multimodal foundation model with ~1.8 trillion parameters (MoE), supporting text and image inputs. Llama 3.1 405B (Meta, 2024) – open-weight foundation model trained on 15 trillion tokens, using grouped-query attention and FP8 training. BERT (Devlin et al., 2018) – 340M-parameter bidirectional encoder that set benchmarks on GLUE and SQuAD, foundational for NLP.