Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Two researchers point at a large monitor displaying a chart comparing iLLaDA and Qwen2.5 benchmark scores, with the…
AI ResearchBreakthroughScore: 89

ByteDance iLLaDA: 8B Diffusion LM Matches Qwen2.5 Base, Lags on Instruct

ByteDance iLLaDA, an 8B diffusion LM trained on 12T tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails 10 points after instruction tuning, revealing the alignment gap for diffusion models.

·1d ago·3 min read··14 views·AI-Generated·Report error
Share:
Source: the-decoder.comvia the_decoderMulti-Source
How does ByteDance's iLLaDA diffusion language model compare to Qwen2.5?

ByteDance and Renmin University's iLLaDA, an 8B diffusion language model trained on 12 trillion tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails after fine-tuning (67.1 vs 77.1).

TL;DR

ByteDance iLLaDA is an 8B diffusion language model. · Matches Qwen2.5 7B base but lags 10 points instruct. · Trained on 12 trillion tokens, beats Dream 7B.

ByteDance and Renmin University released iLLaDA, an 8B diffusion language model that matches Qwen2.5 7B on base benchmarks but trails by 10 points after instruction tuning. The model, trained from scratch on 12 trillion tokens, represents a bet that diffusion can compete with autoregressive generation on quality, not just speed.

Key facts

  • iLLaDA trained on 12 trillion tokens, up from 2.3T for LLaDA.
  • iLLaDA-Base averages 63.9 vs Qwen2.5 7B's 63.3.
  • iLLaDA-Instruct scores 67.1 vs Qwen2.5 7B Instruct's 77.1.
  • BBH reasoning score jumped 21.6 points over LLaDA.
  • Google's DiffusionGemma trades quality for 4x speed.

The Diffusion Alternative to Autoregressive Generation

Nearly all commercial LLMs—GPT, Claude, Qwen—generate text autoregressively: left to right, one token at a time. Diffusion language models like iLLaDA start with a sequence of masked tokens and refine them in parallel across multiple passes, similar to how image diffusion models denoise from random pixels. This bidirectional attention allows every token position to attend to every other simultaneously.

According to The Decoder, iLLaDA is part of a broader movement that includes Google's DiffusionGemma, released in June 2026. DiffusionGemma, built on the 25B-parameter Gemma 4 MoE backbone, generates text about four times faster via diffusion but scores worse on MMLU and code benchmarks. Google recommends it for low-latency use cases, not quality-critical production. iLLaDA takes the opposite approach: a dense 8B model trained from scratch, prioritizing quality over speed.

Benchmark Results: Base Parity, Instruct Gap

The iLLaDA team pretrained the model on 12 trillion tokens—up from 2.3 trillion for its predecessor LLaDA—and fine-tuned for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH. On average it hits 63.9 points, edging past the autoregressive Qwen2.5 7B at 63.3.

Image description

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.

What This Means for the Diffusion LLM Race

ByteDance's iLLaDA demonstrates that a diffusion model trained from scratch can match autoregressive models at the base level—a non-trivial result given that prior diffusion LMs like Dream relied on autoregressive checkpoints for initialization. The 10-point instruct gap, however, highlights the importance of RL-based alignment, which diffusion models have not yet mastered. Google's DiffusionGemma, at a larger 25B parameter count, similarly trades quality for speed, suggesting that diffusion LMs are currently best suited for latency-sensitive applications rather than quality-critical production.

ByteDance has been investing heavily in AI infrastructure. As previously reported, the company purchased tens of thousands of Iluvatar CoreX AI processors for cloud infrastructure in June 2026, signaling its intent to scale AI workloads domestically despite US export controls.

What to watch

Watch for ByteDance to release an iLLaDA variant with RL-based alignment, which could close the instruct gap. Also track whether Google scales DiffusionGemma beyond low-latency niches—if diffusion LMs match autoregressive quality within 12 months, the LLM architecture landscape shifts.


Source: the-decoder.com


Sources cited in this article

Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

iLLaDA's base-level parity with Qwen2.5 is significant because it shows diffusion LMs can match autoregressive models when trained from scratch, not just fine-tuned from existing checkpoints. The 10-point instruct gap, however, underscores a structural limitation: diffusion models lack the RL-based alignment that has become standard for autoregressive models. This mirrors Google's experience with DiffusionGemma, which trades quality for speed. The key question is whether diffusion LMs can close this gap with better alignment techniques, or whether the bidirectional generation process inherently limits instruction-following. ByteDance's investment in domestic AI chips suggests they are betting on diffusion as a long-term architecture, but the instruct gap may limit near-term adoption to latency-sensitive use cases like real-time translation or streaming text. Compared to the autoregressive arms race, diffusion LMs represent a contrarian bet: that parallel generation can eventually match or exceed sequential generation with sufficient scale and alignment. iLLaDA provides the best evidence yet that this bet has merit, but the instruct gap remains a critical hurdle.
This story is part of
Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt
Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance
Compare side-by-side
Google vs ByteDance
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all