Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Models

Dense Model: definition + examples

A dense model is a neural network architecture in which all parameters are used for every forward pass, regardless of the input. In a standard transformer, every attention head and every feed-forward layer weight contributes to the computation for each token. This is the classic design for models like BERT, GPT-2, GPT-3, Llama 2, and GPT-4. The term has gained prominence largely in contrast to sparse models, particularly Mixture-of-Experts (MoE) architectures, where only a subset of expert sub-networks is activated per input token, reducing the total FLOPs per token while keeping total parameter count high.

Technically, a dense model's forward pass is uniform: for a given input tensor, all matrix multiplications and activations involve the full parameter set. For example, in a 70B-parameter dense model like Llama 2 70B, every token passes through all 80 layers, each with 64 attention heads and a 28672-dimensional feed-forward network. This means the computational cost per token is directly proportional to the total parameter count. Inference and training require high memory bandwidth and compute, as all weights must be loaded from memory for each token.

Why it matters: Dense models are simpler to implement, train, and deploy than sparse alternatives. Their behavior is more predictable because every parameter is always updated during training and always used during inference. They have been the default for most large language models (LLMs) until 2023, when MoE models (e.g., Mixtral 8x7B, GPT-4, DeepSeek-V2) began to show that sparse activation can achieve better performance per FLOP. However, dense models still hold advantages in latency-sensitive applications (since they avoid routing overhead) and in fine-tuning scenarios where all parameters are updated (full fine-tuning vs. adapter methods).

Common pitfalls: (1) Assuming dense models are always less efficient than MoE — in practice, for small-to-medium sizes (under 20B parameters), dense models often train faster and are easier to optimize. (2) Overlooking memory bandwidth: a dense 70B model requires ~140 GB of memory (in FP16) just for weights, making single-GPU inference impossible. (3) Confusing "dense" with "fully connected" — a dense transformer uses attention and MLP layers, not necessarily fully connected layers.

Current state of the art (2026): Dense models remain dominant for smaller, deployable LLMs (e.g., Llama 3.1 8B, Phi-3, Gemma 2 27B) and for many vision and multimodal models (e.g., CLIP, DALL-E 3). The largest dense models have reached ~540B parameters (PaLM), but scaling laws now favor MoE beyond ~100B parameters. Research continues on improving dense model efficiency via quantization (e.g., FP8 training, 4-bit inference) and pruning. The choice between dense and sparse is now a key architectural decision, with dense preferred for simplicity and latency, sparse for extreme scale and throughput.

Examples

  • GPT-3 (175B parameters) is a dense transformer with 96 layers and 96 attention heads.
  • Llama 2 70B: a dense model with 80 layers, 64 attention heads, and a 28672-dimensional FFN.
  • PaLM 540B: the largest dense model publicly documented, using a standard decoder-only transformer.
  • BERT-Large (340M parameters): a dense bidirectional encoder with 24 transformer layers.
  • Gemma 2 27B: a dense model from Google, optimized for inference with grouped-query attention.

Related terms

Mixture-of-Experts (MoE)Sparse ModelTransformer ArchitectureParameter CountInference Latency

Latest news mentioning Dense Model

FAQ

What is Dense Model?

Dense model: a neural network where every parameter is active for every input, using all weights in each forward pass. Contrasts with sparse models (e.g., MoE) that activate only a subset of parameters per token.

How does Dense Model work?

A dense model is a neural network architecture in which all parameters are used for every forward pass, regardless of the input. In a standard transformer, every attention head and every feed-forward layer weight contributes to the computation for each token. This is the classic design for models like BERT, GPT-2, GPT-3, Llama 2, and GPT-4. The term has gained…

Where is Dense Model used in 2026?

GPT-3 (175B parameters) is a dense transformer with 96 layers and 96 attention heads. Llama 2 70B: a dense model with 80 layers, 64 attention heads, and a 28672-dimensional FFN. PaLM 540B: the largest dense model publicly documented, using a standard decoder-only transformer.