Skip to content
gentic.news — AI News Intelligence Platform

Technique · inference

FlashAttention

A tiled, IO-aware attention kernel that computes exact attention with linear memory by fusing reads/writes to SRAM.

Origin: Stanford, 2022-05Read origin paper →Also known as: FlashAttention-2, Flash
12
Products deploying
4y
Avg research → prod
4y
First commercial deploy

Deployment timeline

  1. GPT-4o

    Deployed 2026-02-16 · Velocity 4y

    OpenAI's Triton kernels, used for GPT-4, are predecessors to FlashAttention. GPT-4o's speed improvements suggest optimized attention.

    medium
  2. GPT-5

    Deployed 2026-02-16 · Velocity 4y

    OpenAI's technical infrastructure for large models heavily utilizes optimized attention kernels like FlashAttention.

    medium
  3. Gemini 3 Pro

    Deployed 2026-02-19 · Velocity 4y

    Gemini models use Flash-Decoding for efficient attention, a variant of FlashAttention.

    high
  4. Claude 3.5 Sonnet

    Deployed 2026-02-23 · Velocity 4y

    Anthropic's research mentions using FlashAttention for efficient training of their transformer models.

    medium
  5. GPT-5.3

    Deployed 2026-02-26 · Velocity 4y

    OpenAI's models since GPT-3 have utilized attention optimizations; FlashAttention is a standard for efficient large-scale attention.

    medium
  6. Gemini 3 Flash

    Deployed 2026-02-27 · Velocity 4y

    Gemini models use FlashAttention-2 for efficient training and inference, as stated in the Gemini 1.5 technical report.

    high
  7. Kimi K2.5

    Deployed 2026-03-04 · Velocity 4y

    The model card mentions optimizations for efficient inference, which commonly includes FlashAttention for long-context handling.

    medium
  8. DeepSeek-V3

    Deployed 2026-03-11 · Velocity 4y

    DeepSeek-V3 uses FlashAttention-2 for efficient training.

    high
  9. Mistral Small 4

    Deployed 2026-03-16 · Velocity 4y

    Mistral's inference stack supports FlashAttention.

    high
  10. GLM-5.1

    Deployed 2026-03-21 · Velocity 4y

    GLM-5.1 implements FlashAttention-2 for efficient attention computation.

    high
  11. Qwen 3.6

    Deployed 2026-03-31 · Velocity 4y

    Qwen models utilize FlashAttention for efficient training and inference.

    high
  12. GPT-5.4-Cyber

    Deployed 2026-04-16 · Velocity 4y

    OpenAI's models are known to use optimized attention kernels for training and inference efficiency.

    medium