Technique · inference

FlashAttention

A tiled, IO-aware attention kernel that computes exact attention with linear memory by fusing reads/writes to SRAM.

Origin: Stanford, 2022-05Read origin paper →Also known as: FlashAttention-2, Flash

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

GPT-4o
Deployed 2026-02-16 · Velocity 4y
“OpenAI's Triton kernels, used for GPT-4, are predecessors to FlashAttention. GPT-4o's speed improvements suggest optimized attention.”
medium
GPT-5
Deployed 2026-02-16 · Velocity 4y
“OpenAI's technical infrastructure for large models heavily utilizes optimized attention kernels like FlashAttention.”
medium
Gemini 3 Pro
Deployed 2026-02-19 · Velocity 4y
“Gemini models use Flash-Decoding for efficient attention, a variant of FlashAttention.”
high
Claude 3.5 Sonnet
Deployed 2026-02-23 · Velocity 4y
“Anthropic's research mentions using FlashAttention for efficient training of their transformer models.”
medium
GPT-5.3
Deployed 2026-02-26 · Velocity 4y
“OpenAI's models since GPT-3 have utilized attention optimizations; FlashAttention is a standard for efficient large-scale attention.”
medium
Gemini 3 Flash
Deployed 2026-02-27 · Velocity 4y
“Gemini models use FlashAttention-2 for efficient training and inference, as stated in the Gemini 1.5 technical report.”
high
Kimi K2.5
Deployed 2026-03-04 · Velocity 4y
“The model card mentions optimizations for efficient inference, which commonly includes FlashAttention for long-context handling.”
medium
DeepSeek-V3
Deployed 2026-03-11 · Velocity 4y
“DeepSeek-V3 uses FlashAttention-2 for efficient training.”
high
Mistral Small 4
Deployed 2026-03-16 · Velocity 4y
“Mistral's inference stack supports FlashAttention.”
high
GLM-5.1
Deployed 2026-03-21 · Velocity 4y
“GLM-5.1 implements FlashAttention-2 for efficient attention computation.”
high
Qwen 3.6
Deployed 2026-03-31 · Velocity 4y
“Qwen models utilize FlashAttention for efficient training and inference.”
high
GPT-5.4-Cyber
Deployed 2026-04-16 · Velocity 4y
“OpenAI's models are known to use optimized attention kernels for training and inference efficiency.”
medium

Prior art

Transformer Self-Attention

Deployment timeline

Prior art

Techniques built on this