Technique · inference
FlashAttention
A tiled, IO-aware attention kernel that computes exact attention with linear memory by fusing reads/writes to SRAM.
Deployment timeline
- GPT-4omedium
Deployed 2026-02-16 · Velocity 4y
“OpenAI's Triton kernels, used for GPT-4, are predecessors to FlashAttention. GPT-4o's speed improvements suggest optimized attention.”
- GPT-5medium
Deployed 2026-02-16 · Velocity 4y
“OpenAI's technical infrastructure for large models heavily utilizes optimized attention kernels like FlashAttention.”
- Gemini 3 Prohigh
Deployed 2026-02-19 · Velocity 4y
“Gemini models use Flash-Decoding for efficient attention, a variant of FlashAttention.”
- Claude 3.5 Sonnetmedium
Deployed 2026-02-23 · Velocity 4y
“Anthropic's research mentions using FlashAttention for efficient training of their transformer models.”
- GPT-5.3medium
Deployed 2026-02-26 · Velocity 4y
“OpenAI's models since GPT-3 have utilized attention optimizations; FlashAttention is a standard for efficient large-scale attention.”
- Gemini 3 Flashhigh
Deployed 2026-02-27 · Velocity 4y
“Gemini models use FlashAttention-2 for efficient training and inference, as stated in the Gemini 1.5 technical report.”
- Kimi K2.5medium
Deployed 2026-03-04 · Velocity 4y
“The model card mentions optimizations for efficient inference, which commonly includes FlashAttention for long-context handling.”
- DeepSeek-V3high
Deployed 2026-03-11 · Velocity 4y
“DeepSeek-V3 uses FlashAttention-2 for efficient training.”
- Mistral Small 4high
Deployed 2026-03-16 · Velocity 4y
“Mistral's inference stack supports FlashAttention.”
- GLM-5.1high
Deployed 2026-03-21 · Velocity 4y
“GLM-5.1 implements FlashAttention-2 for efficient attention computation.”
- Qwen 3.6high
Deployed 2026-03-31 · Velocity 4y
“Qwen models utilize FlashAttention for efficient training and inference.”
- GPT-5.4-Cybermedium
Deployed 2026-04-16 · Velocity 4y
“OpenAI's models are known to use optimized attention kernels for training and inference efficiency.”