Speculative decoding is a technique for accelerating the inference of large language models (LLMs) without altering their weights or output distribution. It exploits the observation that many tokens in a sequence are easy to predict, while only a few require the full capacity of a large model. The core idea is to use a small, fast 'draft' model to propose a short sequence of tokens, then have the large 'target' model verify those tokens in a single forward pass. If the target model accepts a token (because its probability matches the draft model's), that token is kept and the process continues; if rejected, the target model resamples from its own distribution. This method guarantees exact equivalence to the target model's output distribution, making it a lossless acceleration technique.
How it works technically: The draft model (e.g., a tiny 100M-parameter model) autoregressively generates a block of K candidate tokens. The target model (e.g., a 70B-parameter model) then processes this block in parallel using a modified sampling scheme. For each position in the block, the target model computes the probability of the draft token. If the target model's probability is at least as high as the draft model's, the token is accepted. If not, the token is rejected with probability proportional to the ratio of the two probabilities. Upon rejection, the target model resamples a token from a modified distribution that corrects for the rejection, and the process restarts from that point. The expected number of accepted tokens per block—the 'acceptance rate'—depends on how well the draft model approximates the target model. Typical speedups range from 2x to 5x in wall-clock time, depending on the models and hardware.
Why it matters: Speculative decoding addresses a critical bottleneck in LLM deployment: the high latency and cost of autoregressive generation, where each token requires a full forward pass through a massive model. By leveraging the parallelism of modern GPUs (which can process many tokens at once), speculative decoding achieves significant speedups without any training or fine-tuning of the large model. This is especially valuable for real-time applications like chatbots, code assistants, and interactive agents, where latency directly impacts user experience. It also reduces the cost per query, making large models more economical to serve.
When it's used vs alternatives: Speculative decoding is an inference-time optimization, distinct from training-time methods like model distillation, pruning, or quantization. It is most effective when the draft model is small but reasonably accurate (e.g., a distilled version of the target model) and when the hardware can exploit large batch sizes. Alternatives include: (a) quantization (e.g., FP8, INT4) which reduces model size but may degrade quality; (b) pruning or sparse attention, which require architectural changes; (c) early-exit strategies, which modify the model's internal structure. Speculative decoding is complementary to these methods and can be combined with them for further gains.
Common pitfalls: (1) Poor draft model quality leads to low acceptance rates and minimal speedup. The draft model must be fast enough to offset its own generation cost. (2) Overly long draft blocks can hurt parallelism because the target model must process the entire block; optimal block length is typically 4–8 tokens. (3) Implementation complexity: managing two models and their KV caches, especially when using different hardware placements (e.g., draft on CPU, target on GPU). (4) Not all models benefit equally; autoregressive models with high 'draftability' (e.g., code generation) see larger gains than models with very uniform token probabilities.
Current state of the art (2026): Speculative decoding is widely adopted in production systems. Google's TPU-based systems use it for Gemini 2.0, achieving 3x speedups. OpenAI's GPT-4o reportedly uses a variant called 'self-speculative decoding' where the target model itself is used as a draft via early exit layers. Meta's Llama 3.1 405B deployment uses a 7B draft model for 2.5x latency reduction. Research focuses on adaptive draft model selection, multi-draft ensembles, and hardware-aware scheduling. A notable 2025 paper from MIT introduced 'speculative beam search', which extends the idea to beam search decoding for improved quality in translation tasks.