ByteDance and Renmin University released iLLaDA, an 8B diffusion language model that matches Qwen2.5 7B on base benchmarks but trails by 10 points after instruction tuning. The model, trained from scratch on 12 trillion tokens, represents a bet that diffusion can compete with autoregressive generation on quality, not just speed.
Key facts
- iLLaDA trained on 12 trillion tokens, up from 2.3T for LLaDA.
- iLLaDA-Base averages 63.9 vs Qwen2.5 7B's 63.3.
- iLLaDA-Instruct scores 67.1 vs Qwen2.5 7B Instruct's 77.1.
- BBH reasoning score jumped 21.6 points over LLaDA.
- Google's DiffusionGemma trades quality for 4x speed.
The Diffusion Alternative to Autoregressive Generation
Nearly all commercial LLMs—GPT, Claude, Qwen—generate text autoregressively: left to right, one token at a time. Diffusion language models like iLLaDA start with a sequence of masked tokens and refine them in parallel across multiple passes, similar to how image diffusion models denoise from random pixels. This bidirectional attention allows every token position to attend to every other simultaneously.
According to The Decoder, iLLaDA is part of a broader movement that includes Google's DiffusionGemma, released in June 2026. DiffusionGemma, built on the 25B-parameter Gemma 4 MoE backbone, generates text about four times faster via diffusion but scores worse on MMLU and code benchmarks. Google recommends it for low-latency use cases, not quality-critical production. iLLaDA takes the opposite approach: a dense 8B model trained from scratch, prioritizing quality over speed.
Benchmark Results: Base Parity, Instruct Gap
The iLLaDA team pretrained the model on 12 trillion tokens—up from 2.3 trillion for its predecessor LLaDA—and fine-tuned for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH. On average it hits 63.9 points, edging past the autoregressive Qwen2.5 7B at 63.3.

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.
A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.
What This Means for the Diffusion LLM Race
ByteDance's iLLaDA demonstrates that a diffusion model trained from scratch can match autoregressive models at the base level—a non-trivial result given that prior diffusion LMs like Dream relied on autoregressive checkpoints for initialization. The 10-point instruct gap, however, highlights the importance of RL-based alignment, which diffusion models have not yet mastered. Google's DiffusionGemma, at a larger 25B parameter count, similarly trades quality for speed, suggesting that diffusion LMs are currently best suited for latency-sensitive applications rather than quality-critical production.
ByteDance has been investing heavily in AI infrastructure. As previously reported, the company purchased tens of thousands of Iluvatar CoreX AI processors for cloud infrastructure in June 2026, signaling its intent to scale AI workloads domestically despite US export controls.
What to watch
Watch for ByteDance to release an iLLaDA variant with RL-based alignment, which could close the instruct gap. Also track whether Google scales DiffusionGemma beyond low-latency niches—if diffusion LMs match autoregressive quality within 12 months, the LLM architecture landscape shifts.
Source: the-decoder.com








