Key Takeaways
- DeepMind’s new VAE produces 1024x1024 images with quality comparable to Stable Diffusion’s 256x256 output, potentially replacing the standard VAE in generative pipelines.
- This cuts the token count by 10x, enabling faster generation and lower memory usage.
What Happened

Google DeepMind has released a new Variational Autoencoder (VAE) that generates images at 10x the resolution of Stable Diffusion’s standard VAE, while maintaining comparable visual quality. The model, trained on a 1.2 billion image dataset, achieves a reconstruction FID of 1.28 on 1024x1024 images — matching Stable Diffusion’s VAE at 256x256.
The VAE compresses images into a latent space with 16x fewer tokens than the standard 4x downsampling used in Stable Diffusion. This means an image that previously required 4096 tokens now needs only 256 tokens, reducing memory and compute requirements by roughly 10x.
Technical Details
- Resolution: 1024x1024 output (vs. 256x256 for SD’s VAE)
- Compression ratio: 16x spatial downsampling (vs. 4x)
- Token count: 256 tokens for a 1024x1024 image (vs. 4096)
- Dataset: 1.2 billion images
- Reconstruction FID: 1.28 at 1024x1024
The model uses a novel architecture that combines a convolutional encoder with a transformer-based decoder, enabling efficient high-resolution reconstruction. The encoder downsamples aggressively, and the decoder uses attention mechanisms to recover fine details.
How It Compares
Output resolution 1024x1024 256x256 Token count 256 4096 Reconstruction FID 1.28 ~1.3 (estimated) Compression ratio 16x 4xThe quality is roughly equivalent at the pixel level, but the DeepMind VAE operates at 16x the resolution — meaning generative models built on top of it can produce much larger images without additional upsampling.
What This Means in Practice

For practitioners, this VAE could replace the standard VAE in any Stable Diffusion-based pipeline, enabling 10x faster generation at the same resolution or 10x higher resolution at the same speed. It also reduces memory requirements for training and inference, making high-resolution generation more accessible on consumer GPUs.
Limitations
- The model is not yet open-source; only a technical report has been released.
- Reconstruction quality at very fine details (e.g., text, faces) may still lag behind the original at native resolution.
- Compatibility with existing Stable Diffusion checkpoints and LoRAs is unconfirmed.
gentic.news Analysis
This development directly addresses one of the biggest bottlenecks in diffusion-based image generation: the VAE bottleneck. Stable Diffusion’s standard VAE compresses 256x256 images into 64x64 latent maps, which is fine for generating 512x512 or 1024x1024 images after upsampling. But for truly high-resolution output (e.g., 4K), the token count becomes unwieldy.
DeepMind’s approach — aggressive spatial compression with a transformer decoder — is a natural evolution of the VAE design space. We previously covered OpenAI’s DALL·E 3 VAE improvements, which similarly focused on reducing token counts, but DeepMind’s 16x compression ratio is the most aggressive we’ve seen in a production-quality model.
This also aligns with the broader trend toward token efficiency in generative models. Meta’s CM3leon and Google’s own Muse models have explored similar territory, but this VAE is the first to match SD’s quality at 10x the resolution. If open-sourced, it could become the de facto VAE for the next generation of image generators.
Frequently Asked Questions
Is this VAE available for download?
Not yet. Only a technical report and sample images have been released. No weights or code are publicly available.
Can I use this with my existing Stable Diffusion model?
Probably not without retraining. The latent space dimensions and structure likely differ from the standard VAE, so SD checkpoints would need to be adapted.
How does this compare to other high-resolution VAEs?
It achieves a reconstruction FID of 1.28 at 1024x1024, which is roughly on par with the best existing VAEs (e.g., from SDXL or Kandinsky) but at 4x the compression ratio. This is a significant efficiency gain.
Will this replace Stable Diffusion’s VAE?
If open-sourced and shown to be compatible, it likely will. The 10x speed and memory improvements are compelling, and the quality is comparable.








