How does iLLaDA generate text differently from ChatGPT?

iLLaDA uses diffusion: it starts with masked tokens and refines them in parallel across multiple passes, unlike ChatGPT's left-to-right autoregressive generation.

Why does iLLaDA lag behind Qwen2.5 on instruct tasks?

The authors attribute the gap to Qwen2.5's extra reinforcement learning alignment, which iLLaDA lacks, particularly affecting math and code performance.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Two researchers point at a large monitor displaying a chart comparing iLLaDA and Qwen2.5 benchmark scores, with the…

AI ResearchBreakthroughScore: 89

ByteDance iLLaDA: 8B Diffusion LM Matches Qwen2.5 Base, Lags on Instruct

ByteDance iLLaDA, an 8B diffusion LM trained on 12T tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails 10 points after instruction tuning, revealing the alignment gap for diffusion models.

AAAla SMITH & AI Research Desk·1d ago·3 min read··14 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderMulti-Source

How does ByteDance's iLLaDA diffusion language model compare to Qwen2.5?

ByteDance and Renmin University's iLLaDA, an 8B diffusion language model trained on 12 trillion tokens, matches Qwen2.5 7B on base benchmarks (63.9 vs 63.3) but trails after fine-tuning (67.1 vs 77.1).

TL;DR

ByteDance iLLaDA is an 8B diffusion language model. · Matches Qwen2.5 7B base but lags 10 points instruct. · Trained on 12 trillion tokens, beats Dream 7B.

ByteDance and Renmin University released iLLaDA, an 8B diffusion language model that matches Qwen2.5 7B on base benchmarks but trails by 10 points after instruction tuning. The model, trained from scratch on 12 trillion tokens, represents a bet that diffusion can compete with autoregressive generation on quality, not just speed.

Key facts

iLLaDA trained on 12 trillion tokens, up from 2.3T for LLaDA.
iLLaDA-Base averages 63.9 vs Qwen2.5 7B's 63.3.
iLLaDA-Instruct scores 67.1 vs Qwen2.5 7B Instruct's 77.1.
BBH reasoning score jumped 21.6 points over LLaDA.
Google's DiffusionGemma trades quality for 4x speed.

The Diffusion Alternative to Autoregressive Generation

Nearly all commercial LLMs—GPT, Claude, Qwen—generate text autoregressively: left to right, one token at a time. Diffusion language models like iLLaDA start with a sequence of masked tokens and refine them in parallel across multiple passes, similar to how image diffusion models denoise from random pixels. This bidirectional attention allows every token position to attend to every other simultaneously.

According to The Decoder, iLLaDA is part of a broader movement that includes Google's DiffusionGemma, released in June 2026. DiffusionGemma, built on the 25B-parameter Gemma 4 MoE backbone, generates text about four times faster via diffusion but scores worse on MMLU and code benchmarks. Google recommends it for low-latency use cases, not quality-critical production. iLLaDA takes the opposite approach: a dense 8B model trained from scratch, prioritizing quality over speed.

Benchmark Results: Base Parity, Instruct Gap

The iLLaDA team pretrained the model on 12 trillion tokens—up from 2.3 trillion for its predecessor LLaDA—and fine-tuned for twelve epochs. According to the paper, iLLaDA-Base improves sharply over LLaDA, jumping 21.6 points on the reasoning test BBH. On average it hits 63.9 points, edging past the autoregressive Qwen2.5 7B at 63.3.

Image description

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream wasn't trained from scratch but fine-tuned from an existing Qwen2.5 checkpoint. iLLaDA still beats Dream on average, 63.9 vs. 61.4, even without the head start of a strong autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains at the instruct level. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct hits 77.1, with math and code driving most of the difference. The authors blame this on the extra reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the paper's appendix, they also note that the model can get stuck in reasoning loops on harder tasks.

What This Means for the Diffusion LLM Race

ByteDance's iLLaDA demonstrates that a diffusion model trained from scratch can match autoregressive models at the base level—a non-trivial result given that prior diffusion LMs like Dream relied on autoregressive checkpoints for initialization. The 10-point instruct gap, however, highlights the importance of RL-based alignment, which diffusion models have not yet mastered. Google's DiffusionGemma, at a larger 25B parameter count, similarly trades quality for speed, suggesting that diffusion LMs are currently best suited for latency-sensitive applications rather than quality-critical production.

ByteDance has been investing heavily in AI infrastructure. As previously reported, the company purchased tens of thousands of Iluvatar CoreX AI processors for cloud infrastructure in June 2026, signaling its intent to scale AI workloads domestically despite US export controls.

What to watch

Watch for ByteDance to release an iLLaDA variant with RL-based alignment, which could close the instruct gap. Also track whether Google scales DiffusionGemma beyond low-latency niches—if diffusion LMs match autoregressive quality within 12 months, the LLM architecture landscape shifts.

Source: the-decoder.com

Sources cited in this article

The Decoder

Source: gentic.news · 1d ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

iLLaDA's base-level parity with Qwen2.5 is significant because it shows diffusion LMs can match autoregressive models when trained from scratch, not just fine-tuned from existing checkpoints. The 10-point instruct gap, however, underscores a structural limitation: diffusion models lack the RL-based alignment that has become standard for autoregressive models. This mirrors Google's experience with DiffusionGemma, which trades quality for speed. The key question is whether diffusion LMs can close this gap with better alignment techniques, or whether the bidirectional generation process inherently limits instruction-following. ByteDance's investment in domestic AI chips suggests they are betting on diffusion as a long-term architecture, but the instruct gap may limit near-term adoption to latency-sensitive use cases like real-time translation or streaming text. Compared to the autoregressive arms race, diffusion LMs represent a contrarian bet: that parallel generation can eventually match or exceed sequential generation with sufficient scale and alignment. iLLaDA provides the best evidence yet that this bet has merit, but the instruct gap remains a critical hurdle.

#llm benchmarks #diffusion models #bytedance #ai research

This story is part of

Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt

Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Compare side-by-side

Google vs ByteDance

→

Mentioned in this article

LLaDA ByteDance Qwen 2.5 7B DiffusionGemma Google Renmin University

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

ByteDance iLLaDA: 8B Diffusion LM Matches Qwen2.5 Base, Lags on Instruct

The Diffusion Alternative to Autoregressive Generation

Benchmark Results: Base Parity, Instruct Gap

What This Means for the Diffusion LLM Race

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

ByteDance Builds In-House AI CPUs for TikTok-Scale Agent Inference

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Tencent Open-Sources Agent Memory System Cutting Token Use 61%

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

ByteDance Seed's SpatialTree Redefines MLLM Spatial Reasoning at CVPR 2026

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

The framework underneath this story

More in AI Research

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0

ICWM Lets Robots Adapt to Unseen Morphologies in Seconds