Google's Gemma 4 achieves up to 3x faster inference using novel MTP drafters. The claim, posted on X by @googledevs, promises the same quality with dramatically reduced latency.
Key facts
- Gemma 4 claims up to 3x faster inference.
- MTP drafters predict multiple tokens per step.
- No benchmark numbers or architectural details released.
- Prior Gemma 3 launched with 2B, 7B, 27B variants in March 2026.
Google announced Gemma 4 with a speedup claim of up to 3x over prior versions, enabled by new MTP (Multi-Token Prediction) drafters. According to the tweet from @googledevs, these drafters predict multiple tokens per forward pass, a departure from standard autoregressive generation that predicts one token at a time.
The unique take here is that MTP drafters represent a practical application of speculative decoding techniques, which have been explored in research (e.g., Leviathan et al. 2023) but rarely deployed as a core feature of a production model family. Speculative decoding typically uses a small draft model to propose tokens and a target model to verify them; Gemma 4's MTP drafters appear to integrate this into the model itself, potentially reducing the memory and latency overhead of running two separate models.
Google did not disclose specific benchmark numbers, model sizes, or hardware configurations used for the speedup claim. The tweet offers no architectural details, training compute, or comparison against prior Gemma versions or competitors like Llama 4 or Mistral. The claim of "same quality" is also unsubstantiated — no perplexity, MMLU, or HumanEval scores were provided.
This announcement aligns with Google's pattern of incremental model releases. Gemma 3 launched in March 2026 with 2B, 7B, and 27B variants; Gemma 4's speed improvements could be critical for on-device and edge deployments where latency is a bottleneck.
What to Watch
Watch for Google to release technical documentation or a paper detailing MTP drafters. The key metric to track is whether the 3x speedup holds on standard inference hardware (e.g., A100, H100, TPU v5) and whether quality metrics like MMLU or GSM8K remain within 1% of the baseline. Also monitor for open-source implementations of MTP drafters from the community.
What to watch
Watch for Google to release a technical paper or blog post detailing MTP drafter architecture. Key metrics: inference latency on A100/H100, quality scores (MMLU, GSM8K) compared to Gemma 3, and whether open-source implementations emerge within 30 days.








