How does DiffusionGemma generate text differently from standard LLMs?

Instead of predicting one token at a time autoregressively, DiffusionGemma starts with noise and iteratively denoises the entire output sequence in parallel, similar to image diffusion models.

Is DiffusionGemma as good as Gemma 4 in quality?

No — Google acknowledges output quality is lower and positions it as an experimental tool for developers, not a replacement for autoregressive models.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

AI ResearchBreakthroughScore: 100

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.

AAAla SMITH & AI Research Desk·Jun 10, 2026·3 min read··196 views·AI-Generated·Report error

Source: simonwillison.netvia simon_willison, the_decoder, @HuggingPapers, tomshardwareWidely Reported

What is DiffusionGemma and how fast is it?

Google open-sourced DiffusionGemma, a 26-billion-parameter model that generates text via diffusion, hitting ~1,000 tokens per second on a single Nvidia H100 GPU — roughly 4x faster than comparable autoregressive models.

TL;DR

Google released DiffusionGemma under Apache 2.0 license. · 26B-parameter model generates text via diffusion, not autoregression. · Nvidia claims 1,000 tokens/sec on a single H100 GPU.

Google released DiffusionGemma on June 10, a 26B-parameter open-weight model that generates text via diffusion. Nvidia claims 1,000 tokens per second on a single H100 GPU — roughly 4x faster than autoregressive models like Gemma 4.

Key facts

26 billion total parameters, ~4 billion active per token (MoE).
1,000 tokens per second claimed on a single H100 GPU.
Apache 2.0 license — fully open-weight.
Available on Hugging Face: google/diffusiongemma-26B-A4B-it.
Nvidia hosts free inference on NIM cloud API.

Google released DiffusionGemma, a 26-billion-parameter model that generates text not token by token but through diffusion, similar to how image AI turns noise into a picture. According to The Decoder and Simon Willison's blog, the model is available on Hugging Face as google/diffusiongemma-26B-A4B-it under an Apache 2 license — a significant departure from Google's typically more restricted model releases.

How it works and why speed matters

DiffusionGemma eschews the standard autoregressive approach (predicting one token at a time) for a continuous diffusion process that iteratively denoises a latent representation of the entire output sequence. This parallel generation is what enables the speedup: Nvidia claims it hits about 1,000 tokens per second on a single H100 GPU, roughly four times faster than comparable autoregressive models. Simon Willison tested the model via Nvidia's NIM cloud API, reporting 2,409 tokens generated in 4.4 seconds — at least 500 tokens/second, with overhead from Python tooling, so raw inference is likely faster.

This isn't Google's first diffusion-for-text experiment. Last May, Google briefly released an experimental Gemini Diffusion model; Willison recorded it running at 857 tokens/second at the time. That research has now returned as a fully open-weight Gemma model, suggesting Google is serious about making diffusion-based text generation a production-ready alternative.

Quality trade-off and positioning

Output quality is lower, so Google is positioning it as an experimental tool for developers for now. The model is a 26B-parameter Mixture of Experts (26B-A4B), meaning only ~4B parameters are active per token — a design choice that keeps inference cheap. Nvidia is currently hosting the model for free on their NIM cloud API, lowering the barrier for developers to experiment.

Community reaction and context

Hacker News commenters noted the strategic significance: "Google keeps flexin'. It's surprising that Gemini isn't more competitive against Claude or OpenAI models for code and agentic use, because it's clear Google still has some of the best AI people in the business." The model's speed makes it particularly relevant for on-device and near-realtime use cases — a domain where Google has invested heavily, from Gemini Nano to TPU v6e deployments.

What to watch

Watch for benchmark results on standard NLP tasks (MMLU, HellaSwag, HumanEval) as the community stress-tests DiffusionGemma against Gemma 4 and Llama 4. The key question is whether the quality gap narrows with fine-tuning or larger diffusion steps. Also watch for Nvidia's NIM usage metrics — if developer adoption spikes, it signals real demand for non-autoregressive architectures.

Flat minimalist illustration of a white pelican with a large orange beak riding a red bicycle with black wheels, against a pale blue background with a

Source: simonwillison.net

[Updated 11 Jun via tomshardware]

Separately, Google has booked Intel to package more than 3 million of its TPUs in 2028 after months of testing Intel's advanced EMIB packaging for HBM integration, according to Tom's Hardware. SK hynix is also testing Intel's EMIB packaging for HBM integration, signaling a deepening supply chain relationship between Google, Intel, and memory makers.

Sources cited in this article

The Decoder
Tom's Hardware. SK

Source: gentic.news · Jun 10, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

DiffusionGemma represents a genuine architectural shift, not just another model release. The autoregressive paradigm has dominated LLMs since Vaswani et al. 2017, and while speculative decoding and quantization have improved throughput, they haven't changed the fundamental sequential nature of generation. Diffusion breaks that constraint by generating the entire sequence in parallel, which is why the 4x speedup is real — it's not a compression trick or a hardware optimization, but a different mathematical approach. The timing is strategic. Google has been investing heavily in on-device and near-realtime AI — Gemini Nano for phones, TPU v6e for edge, and now a model architecture that can generate responses in milliseconds rather than seconds. If DiffusionGemma's quality can be improved through fine-tuning or larger diffusion steps, it could reshape expectations for latency-sensitive applications like voice assistants, real-time translation, and live coding completion. The Apache 2 license is also notable. Google's Gemma models have been open-weight but with usage restrictions; this is a full open-source release. That suggests Google wants broad developer adoption to build the ecosystem, even if it means giving up control. Nvidia's free NIM hosting further lowers friction — a rare alignment of incentives between the two companies.

#open-source #diffusion-model #llm-performance #nvidia #google

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Google vs Nvidia

→

Mentioned in this article

Google DiffusionGemma Nvidia Hugging Face

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Amazon, Nvidia, AMD Lead $310M Odyssey ML Round at $1.45B Valuation

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

How it works and why speed matters

Quality trade-off and positioning

Community reaction and context

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Google’s Frozen v2 chip: 6–10× tokens/W for Gemini, 2028 target

Nvidia Vows 'Giant Amounts' of Vera Rubin as Blackwell Delays Bite

Nvidia Vera Rubin Rack Costs $7.8M; Memory Drives Price

Anthropic Explores Custom AI Chip with Samsung

OpenAI-Broadcom Chip Hints at Token Price Collapse

Amazon, Nvidia, AMD Lead $310M Odyssey ML Round at $1.45B Valuation

The framework underneath this story

More in AI Research

Opus 5 Hits 0% Prompt Injection Rate in Browser Agents

GPT-5.6 Sol Leads DeepSWE at 72.7%, Beating Opus 5's 68.8%

Alibaba Releases RynnBrain 1.1 Embodied AI Models at 2B-122B Scales