A dense model is a neural network architecture in which all parameters are used for every forward pass, regardless of the input. In a standard transformer, every attention head and every feed-forward layer weight contributes to the computation for each token. This is the classic design for models like BERT, GPT-2, GPT-3, Llama 2, and GPT-4. The term has gained prominence largely in contrast to sparse models, particularly Mixture-of-Experts (MoE) architectures, where only a subset of expert sub-networks is activated per input token, reducing the total FLOPs per token while keeping total parameter count high.
Technically, a dense model's forward pass is uniform: for a given input tensor, all matrix multiplications and activations involve the full parameter set. For example, in a 70B-parameter dense model like Llama 2 70B, every token passes through all 80 layers, each with 64 attention heads and a 28672-dimensional feed-forward network. This means the computational cost per token is directly proportional to the total parameter count. Inference and training require high memory bandwidth and compute, as all weights must be loaded from memory for each token.
Why it matters: Dense models are simpler to implement, train, and deploy than sparse alternatives. Their behavior is more predictable because every parameter is always updated during training and always used during inference. They have been the default for most large language models (LLMs) until 2023, when MoE models (e.g., Mixtral 8x7B, GPT-4, DeepSeek-V2) began to show that sparse activation can achieve better performance per FLOP. However, dense models still hold advantages in latency-sensitive applications (since they avoid routing overhead) and in fine-tuning scenarios where all parameters are updated (full fine-tuning vs. adapter methods).
Common pitfalls: (1) Assuming dense models are always less efficient than MoE — in practice, for small-to-medium sizes (under 20B parameters), dense models often train faster and are easier to optimize. (2) Overlooking memory bandwidth: a dense 70B model requires ~140 GB of memory (in FP16) just for weights, making single-GPU inference impossible. (3) Confusing "dense" with "fully connected" — a dense transformer uses attention and MLP layers, not necessarily fully connected layers.
Current state of the art (2026): Dense models remain dominant for smaller, deployable LLMs (e.g., Llama 3.1 8B, Phi-3, Gemma 2 27B) and for many vision and multimodal models (e.g., CLIP, DALL-E 3). The largest dense models have reached ~540B parameters (PaLM), but scaling laws now favor MoE beyond ~100B parameters. Research continues on improving dense model efficiency via quantization (e.g., FP8 training, 4-bit inference) and pruning. The choice between dense and sparse is now a key architectural decision, with dense preferred for simplicity and latency, sparse for extreme scale and throughput.