Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Speculative Decoding: definition + examples

Speculative decoding is a technique for accelerating the inference of large language models (LLMs) without altering their weights or output distribution. It exploits the observation that many tokens in a sequence are easy to predict, while only a few require the full capacity of a large model. The core idea is to use a small, fast 'draft' model to propose a short sequence of tokens, then have the large 'target' model verify those tokens in a single forward pass. If the target model accepts a token (because its probability matches the draft model's), that token is kept and the process continues; if rejected, the target model resamples from its own distribution. This method guarantees exact equivalence to the target model's output distribution, making it a lossless acceleration technique.

How it works technically: The draft model (e.g., a tiny 100M-parameter model) autoregressively generates a block of K candidate tokens. The target model (e.g., a 70B-parameter model) then processes this block in parallel using a modified sampling scheme. For each position in the block, the target model computes the probability of the draft token. If the target model's probability is at least as high as the draft model's, the token is accepted. If not, the token is rejected with probability proportional to the ratio of the two probabilities. Upon rejection, the target model resamples a token from a modified distribution that corrects for the rejection, and the process restarts from that point. The expected number of accepted tokens per block—the 'acceptance rate'—depends on how well the draft model approximates the target model. Typical speedups range from 2x to 5x in wall-clock time, depending on the models and hardware.

Why it matters: Speculative decoding addresses a critical bottleneck in LLM deployment: the high latency and cost of autoregressive generation, where each token requires a full forward pass through a massive model. By leveraging the parallelism of modern GPUs (which can process many tokens at once), speculative decoding achieves significant speedups without any training or fine-tuning of the large model. This is especially valuable for real-time applications like chatbots, code assistants, and interactive agents, where latency directly impacts user experience. It also reduces the cost per query, making large models more economical to serve.

When it's used vs alternatives: Speculative decoding is an inference-time optimization, distinct from training-time methods like model distillation, pruning, or quantization. It is most effective when the draft model is small but reasonably accurate (e.g., a distilled version of the target model) and when the hardware can exploit large batch sizes. Alternatives include: (a) quantization (e.g., FP8, INT4) which reduces model size but may degrade quality; (b) pruning or sparse attention, which require architectural changes; (c) early-exit strategies, which modify the model's internal structure. Speculative decoding is complementary to these methods and can be combined with them for further gains.

Common pitfalls: (1) Poor draft model quality leads to low acceptance rates and minimal speedup. The draft model must be fast enough to offset its own generation cost. (2) Overly long draft blocks can hurt parallelism because the target model must process the entire block; optimal block length is typically 4–8 tokens. (3) Implementation complexity: managing two models and their KV caches, especially when using different hardware placements (e.g., draft on CPU, target on GPU). (4) Not all models benefit equally; autoregressive models with high 'draftability' (e.g., code generation) see larger gains than models with very uniform token probabilities.

Current state of the art (2026): Speculative decoding is widely adopted in production systems. Google's TPU-based systems use it for Gemini 2.0, achieving 3x speedups. OpenAI's GPT-4o reportedly uses a variant called 'self-speculative decoding' where the target model itself is used as a draft via early exit layers. Meta's Llama 3.1 405B deployment uses a 7B draft model for 2.5x latency reduction. Research focuses on adaptive draft model selection, multi-draft ensembles, and hardware-aware scheduling. A notable 2025 paper from MIT introduced 'speculative beam search', which extends the idea to beam search decoding for improved quality in translation tasks.

Examples

  • Google Gemini 2.0 uses speculative decoding with a small draft model to achieve ~3x inference speedup on TPU v5p clusters.
  • Meta's Llama 3.1 405B is served with a 7B-parameter draft model (distilled from the 405B) yielding 2.5x lower latency in production.
  • OpenAI's GPT-4o employs self-speculative decoding where early layers of the same model act as the draft, avoiding the need for a separate model.
  • Hugging Face's Text Generation Inference (TGI) library added native speculative decoding support in v2.0, enabling users to pair any Hugging Face model with a smaller draft.
  • The 2024 paper 'Fast Inference from Transformers via Speculative Decoding' (Leviathan et al., ICML 2024) demonstrated 2x–3x speedups on OPT-175B using a 125M-parameter draft.

Related terms

KV CacheModel DistillationQuantizationParallel DecodingInference Optimization

Latest news mentioning Speculative Decoding

FAQ

What is Speculative Decoding?

Speculative decoding is an inference-time technique that uses a small draft model to generate candidate tokens, which are then verified in parallel by a large target model, achieving speedups without modifying the target model's weights.

How does Speculative Decoding work?

Speculative decoding is a technique for accelerating the inference of large language models (LLMs) without altering their weights or output distribution. It exploits the observation that many tokens in a sequence are easy to predict, while only a few require the full capacity of a large model. The core idea is to use a small, fast 'draft' model to propose…

Where is Speculative Decoding used in 2026?

Google Gemini 2.0 uses speculative decoding with a small draft model to achieve ~3x inference speedup on TPU v5p clusters. Meta's Llama 3.1 405B is served with a 7B-parameter draft model (distilled from the 405B) yielding 2.5x lower latency in production. OpenAI's GPT-4o employs self-speculative decoding where early layers of the same model act as the draft, avoiding the need for a separate model.