Skip to content
gentic.news — AI News Intelligence Platform

Technique · architecture

Transformer Self-Attention

A sequence-to-sequence architecture that replaces recurrence with scaled dot-product attention, enabling parallel training and long-range context modeling.

Origin: Google, 2017-06Read origin paper →Also known as: Transformer, Self-Attention, Scaled Dot-Product Attention
25
Products deploying
9y
Avg research → prod
8y
First commercial deploy

Deployment timeline

  1. Llama 4 Scout

    Deployed 2025-04-05 · Velocity 8y

    All Llama models are Transformer-based; Llama 4 Scout is described as a multimodal MoE model.

    high
  2. Llama 4 Maverick

    Deployed 2025-04-05 · Velocity 8y

    Llama 4 is a Transformer-based LLM, the core architecture is self-attention.

    high
  3. Claude Opus 4.6

    Deployed 2026-02-16 · Velocity 9y

    Claude is based on transformer architecture with self-attention mechanisms.

    high
  4. GPT-4o

    Deployed 2026-02-16 · Velocity 9y

    GPT-4o is a Transformer-based model, the core architecture of all GPT models.

    high
  5. GPT-5

    Deployed 2026-02-16 · Velocity 9y

    GPT-5 is a Generative Pre-trained Transformer, fundamentally based on the Transformer architecture.

    high
  6. GPT-5.2 Pro

    Deployed 2026-02-17 · Velocity 9y

    GPT-5.2 is a direct successor in the GPT series, which is fundamentally based on the Transformer architecture.

    high
  7. Claude 3

    Deployed 2026-02-18 · Velocity 9y

    Claude 3 is built on a Transformer architecture with self-attention, as stated in its technical report.

    high
  8. Gemini 3 Pro

    Deployed 2026-02-19 · Velocity 9y

    Gemini is a Transformer-based decoder-only model.

    high
  9. Gemini 3.1

    Deployed 2026-02-20 · Velocity 9y

    Gemini is a Transformer-based model, using self-attention as its core architecture.

    high
  10. Claude 3.5 Sonnet

    Deployed 2026-02-23 · Velocity 9y

    Claude 3.5 Sonnet is built on transformer architecture with self-attention mechanisms.

    high
  11. Claude Sonnet 4.6

    Deployed 2026-02-25 · Velocity 9y

    Claude Sonnet is a Transformer-based large language model.

    high
  12. Claude Haiku 4.5

    Deployed 2026-02-25 · Velocity 9y

    Claude models are transformer-based language models.

    high
  13. GPT-5.3

    Deployed 2026-02-26 · Velocity 9y

    GPT-5.3 is a Transformer-based model, using self-attention as its core architecture.

    high
  14. Claude 4.5

    Deployed 2026-02-26 · Velocity 9y

    Claude is built on transformer architecture with self-attention mechanisms.

    high
  15. Gemini 3 Flash

    Deployed 2026-02-27 · Velocity 9y

    Gemini models are based on the Transformer architecture, using decoder-only models with self-attention.

    high
  16. GPT-OSS-120B

    Deployed 2026-03-02 · Velocity 9y

    GPT-OSS-120B is a 120-billion parameter model, which fundamentally relies on the transformer self-attention architecture.

    high
  17. Grok 4.20

    Deployed 2026-03-02 · Velocity 9y

    Grok is a large language model built on the Transformer architecture.

    high
  18. Kimi K2.5

    Deployed 2026-03-04 · Velocity 9y

    Kimi K2.5 is fundamentally a Transformer-based model, using self-attention as its core architecture.

    high
  19. Gemini 3.1 Flash-Lite

    Deployed 2026-03-05 · Velocity 9y

    Gemini models are based on the Transformer decoder architecture.

    high
  20. Nemotron 3 Super

    Deployed 2026-03-11 · Velocity 9y

    Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.

    high
  21. DeepSeek-R1

    Deployed 2026-03-17 · Velocity 9y

    Based on transformer architecture with self-attention mechanisms.

    high
  22. Claude 3.5 Opus

    Deployed 2026-03-18 · Velocity 9y

    Claude models are transformer-based; Opus uses standard transformer architecture.

    high
  23. Qwen 3.6

    Deployed 2026-03-31 · Velocity 9y

    Qwen 3.6 is a Transformer-based large language model.

    high
  24. GPT-5.4-Cyber

    Deployed 2026-04-16 · Velocity 9y

    GPT models are based on the Transformer architecture with self-attention.

    high
  25. Claude Opus 4.7

    Deployed 2026-04-16 · Velocity 9y

    All Claude models are based on the Transformer architecture, which is foundational and explicitly stated in their technical documentation.

    high