Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Google logo on a dark background with abstract blue and green digital lines suggesting AI data flow
AI ResearchBreakthroughScore: 90

Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100

Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.

·16h ago·3 min read··23 views·AI-Generated·Report error
Share:
Source: simonwillison.netvia simon_willison, the_decoder, @HuggingPapersMulti-Source
What is DiffusionGemma and how fast is it?

Google open-sourced DiffusionGemma, a 26-billion-parameter model that generates text via diffusion, hitting ~1,000 tokens per second on a single Nvidia H100 GPU — roughly 4x faster than comparable autoregressive models.

TL;DR

Google released DiffusionGemma under Apache 2.0 license. · 26B-parameter model generates text via diffusion, not autoregression. · Nvidia claims 1,000 tokens/sec on a single H100 GPU.

Google released DiffusionGemma on June 10, a 26B-parameter open-weight model that generates text via diffusion. Nvidia claims 1,000 tokens per second on a single H100 GPU — roughly 4x faster than autoregressive models like Gemma 4.

Key facts

  • 26 billion total parameters, ~4 billion active per token (MoE).
  • 1,000 tokens per second claimed on a single H100 GPU.
  • Apache 2.0 license — fully open-weight.
  • Available on Hugging Face: google/diffusiongemma-26B-A4B-it.
  • Nvidia hosts free inference on NIM cloud API.

Google released DiffusionGemma, a 26-billion-parameter model that generates text not token by token but through diffusion, similar to how image AI turns noise into a picture. According to The Decoder and Simon Willison's blog, the model is available on Hugging Face as google/diffusiongemma-26B-A4B-it under an Apache 2 license — a significant departure from Google's typically more restricted model releases.

How it works and why speed matters

DiffusionGemma eschews the standard autoregressive approach (predicting one token at a time) for a continuous diffusion process that iteratively denoises a latent representation of the entire output sequence. This parallel generation is what enables the speedup: Nvidia claims it hits about 1,000 tokens per second on a single H100 GPU, roughly four times faster than comparable autoregressive models. Simon Willison tested the model via Nvidia's NIM cloud API, reporting 2,409 tokens generated in 4.4 seconds — at least 500 tokens/second, with overhead from Python tooling, so raw inference is likely faster.

This isn't Google's first diffusion-for-text experiment. Last May, Google briefly released an experimental Gemini Diffusion model; Willison recorded it running at 857 tokens/second at the time. That research has now returned as a fully open-weight Gemma model, suggesting Google is serious about making diffusion-based text generation a production-ready alternative.

Quality trade-off and positioning

Output quality is lower, so Google is positioning it as an experimental tool for developers for now. The model is a 26B-parameter Mixture of Experts (26B-A4B), meaning only ~4B parameters are active per token — a design choice that keeps inference cheap. Nvidia is currently hosting the model for free on their NIM cloud API, lowering the barrier for developers to experiment.

Community reaction and context

Hacker News commenters noted the strategic significance: "Google keeps flexin'. It's surprising that Gemini isn't more competitive against Claude or OpenAI models for code and agentic use, because it's clear Google still has some of the best AI people in the business." The model's speed makes it particularly relevant for on-device and near-realtime use cases — a domain where Google has invested heavily, from Gemini Nano to TPU v6e deployments.

What to watch

Watch for benchmark results on standard NLP tasks (MMLU, HellaSwag, HumanEval) as the community stress-tests DiffusionGemma against Gemma 4 and Llama 4. The key question is whether the quality gap narrows with fine-tuning or larger diffusion steps. Also watch for Nvidia's NIM usage metrics — if developer adoption spikes, it signals real demand for non-autoregressive architectures.

Flat minimalist illustration of a white pelican with a large orange beak riding a red bicycle with black wheels, against a pale blue background with a


Source: simonwillison.net


Sources cited in this article

  1. The Decoder
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

DiffusionGemma represents a genuine architectural shift, not just another model release. The autoregressive paradigm has dominated LLMs since Vaswani et al. 2017, and while speculative decoding and quantization have improved throughput, they haven't changed the fundamental sequential nature of generation. Diffusion breaks that constraint by generating the entire sequence in parallel, which is why the 4x speedup is real — it's not a compression trick or a hardware optimization, but a different mathematical approach. The timing is strategic. Google has been investing heavily in on-device and near-realtime AI — Gemini Nano for phones, TPU v6e for edge, and now a model architecture that can generate responses in milliseconds rather than seconds. If DiffusionGemma's quality can be improved through fine-tuning or larger diffusion steps, it could reshape expectations for latency-sensitive applications like voice assistants, real-time translation, and live coding completion. The Apache 2 license is also notable. Google's Gemma models have been open-weight but with usage restrictions; this is a full open-source release. That suggests Google wants broad developer adoption to build the ecosystem, even if it means giving up control. Nvidia's free NIM hosting further lowers friction — a rare alignment of incentives between the two companies.
This story is part of
The AI Infrastructure War Shifts from Chips to Developer Tools
Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent
Compare side-by-side
Google vs Nvidia
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all