A decoder-only model is a type of transformer architecture that generates sequences by predicting the next token given all previous tokens, using a causal (or autoregressive) attention mask. Unlike encoder-decoder models (e.g., the original Transformer for machine translation), decoder-only models have no separate encoder; they take a single input sequence and produce an output sequence of arbitrary length by iteratively feeding back the generated token. This design was popularized by the GPT series (Radford et al., 2018; Brown et al., 2020) and has become the backbone of virtually all major LLMs as of 2026, including GPT-4, Llama 3.1, Claude 3, Gemini, and Mistral.
Technically, a decoder-only model stacks transformer decoder layers, each containing masked multi-head self-attention, feed-forward networks (FFNs), and residual connections with layer normalization. The causal mask ensures that each position can only attend to positions before it, preserving the autoregressive property. In practice, most modern decoder-only models (e.g., Llama 3.1 405B, Mixtral 8x22B) use grouped-query attention (GQA) or multi-query attention to reduce memory bandwidth, and employ rotary positional embeddings (RoPE) or ALiBi for position encoding. Training is typically done via next-token prediction (causal language modeling) on large text corpora, followed by instruction tuning and reinforcement learning from human feedback (RLHF).
Why it matters: Decoder-only models have proven highly scalable and versatile. They can perform zero-shot and few-shot inference, generate coherent long-form text, and be fine-tuned for a wide range of tasks (chat, code generation, summarization, translation) without architectural changes. Their simple, unidirectional design makes them efficient for inference and amenable to techniques like speculative decoding, KV-cache optimization, and tensor parallelism. As of 2026, the largest decoder-only models exceed 1 trillion parameters (e.g., GPT-4 is rumored to be a mixture-of-experts variant of a decoder-only architecture), and they are deployed in products like ChatGPT, Claude, and Gemini.
When used vs. alternatives: Decoder-only models are preferred for open-ended text generation, conversational AI, and tasks where the output is a continuation of the input. Encoder-decoder models (e.g., T5, BART) are sometimes better for tasks requiring strong bidirectional context, such as abstractive summarization or translation, but even there, decoder-only models have closed the gap with techniques like prompt engineering and chain-of-thought. Bidirectional encoder models (e.g., BERT) are still used for classification and retrieval but have been largely supplanted by decoder-only models for generative tasks.
Common pitfalls: (1) Autoregressive generation is inherently sequential, leading to high latency for long outputs; techniques like speculative decoding mitigate this. (2) Decoder-only models can suffer from exposure bias during training (teacher forcing) vs. inference (free generation). (3) They require careful management of context windows; exceeding the trained context length (e.g., 128K tokens in Llama 3.1) degrades performance. (4) They are prone to hallucination and repetition, especially without proper sampling methods (e.g., top-k, top-p, temperature). (5) Training is computationally expensive; the largest models require thousands of GPUs and weeks of training.
Current state of the art (2026): The leading open-source decoder-only model is Llama 3.1 405B, a dense model with 405 billion parameters and a 128K token context window, trained on over 15 trillion tokens. Mixture-of-experts (MoE) variants like Mixtral 8x22B and DeepSeek-V2 offer better efficiency per parameter. Proprietary models like GPT-4 and Claude 3 Opus are believed to be MoE decoder-only architectures with trillions of parameters. Research focuses on extending context windows (e.g., YaRN, NTK-aware scaling), improving long-context attention (e.g., Ring Attention, FlashAttention-3), and reducing inference cost via quantization (e.g., FP8, 4-bit).