Diffusion models are a class of generative models that learn to produce data by reversing a Markov chain of diffusion steps. They were inspired by non-equilibrium thermodynamics and first formalized in a 2015 paper by Sohl-Dickstein et al., but gained widespread adoption after Ho et al. (2020) introduced Denoising Diffusion Probabilistic Models (DDPMs), showing they could match or exceed GANs in image quality.
How they work: The core idea is two-phase: (1) a forward process that gradually adds Gaussian noise to a clean data sample over a fixed number of timesteps (e.g., 1000 steps) until it becomes pure noise, and (2) a reverse process where a neural network (typically a U-Net with attention layers) is trained to predict the noise added at each step, thereby denoising the sample back to the original data distribution. The model is trained by minimizing a simple mean-squared error between predicted and actual noise. At inference, a random noise vector is iteratively denoised through the reverse steps to generate a new sample. Later improvements like Denoising Diffusion Implicit Models (DDIMs) reduced the number of required steps (from 1000 to 10–50) by using non-Markovian sampling, and latent diffusion models (Rombach et al., 2022) moved the process into a compressed latent space (via a VAE), dramatically reducing computational cost while maintaining quality — this is the foundation of Stable Diffusion.
Why they matter: Diffusion models set new standards for image generation quality (e.g., FID scores <10 on ImageNet 256x256), sample diversity, and mode coverage. They avoid the training instability of GANs and the mode-collapse issue. Their ability to condition on text (via cross-attention) led to breakthroughs like DALL·E 2, Imagen, and Stable Diffusion, which power tools with hundreds of millions of users. As of 2026, diffusion models are also dominant in video generation (e.g., OpenAI Sora, Runway Gen-3 Alpha), 3D content creation (DreamFusion, Zero-1-to-3), and audio generation (Stable Audio, MusicGen).
When used vs. alternatives: Diffusion models are preferred when high sample quality and diversity are paramount, especially for images, video, and audio. They are less suitable for real-time applications (e.g., interactive chatbots) due to slow iterative sampling, though distillation (e.g., Progressive Distillation, consistency models) has cut inference to 1–4 steps. GANs remain faster but are harder to train and more prone to collapse; VAEs are faster but produce blurrier outputs; autoregressive models excel at text but struggle with high-dimensional continuous data like images.
Common pitfalls: (1) Slow sampling — even with 10–50 steps, generation is slower than GANs or VAEs; (2) Sensitivity to hyperparameters (noise schedule, number of steps, learning rate); (3) Tendency to produce artifacts when generating out-of-distribution conditions (e.g., hands in text-to-image); (4) High memory usage due to large U-Net architectures and attention maps; (5) Difficulty in controlled generation without guidance (classifier-free guidance is standard but requires extra tuning).
Current state of the art (2026): Diffusion models are the backbone of most commercial generative media platforms. Key milestones: Stable Diffusion 3.5 (2024) uses a rectified flow transformer (MMDiT) for improved text rendering and efficiency; OpenAI’s Sora (2024) scales diffusion transformers to video with spatiotemporal patches; Google’s Veo 2 (2025) produces 4K video with consistent long-range structure; consistency models (Song et al., 2023) and adversarial diffusion distillation (Sauer et al., 2024) enable real-time generation in 1–2 steps. Research is moving toward unified diffusion models that handle text, image, video, and audio jointly (e.g., Meta’s CM3leon, Google’s MUSE).