Technique · architecture

Transformer Self-Attention

A sequence-to-sequence architecture that replaces recurrence with scaled dot-product attention, enabling parallel training and long-range context modeling.

Origin: Google, 2017-06Read origin paper →Also known as: Transformer, Self-Attention, Scaled Dot-Product Attention

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

Llama 4 Scout
Deployed 2025-04-05 · Velocity 8y
“All Llama models are Transformer-based; Llama 4 Scout is described as a multimodal MoE model.”
high
Llama 4 Maverick
Deployed 2025-04-05 · Velocity 8y
“Llama 4 is a Transformer-based LLM, the core architecture is self-attention.”
high
Claude Opus 4.6
Deployed 2026-02-16 · Velocity 9y
“Claude is based on transformer architecture with self-attention mechanisms.”
high
GPT-4o
Deployed 2026-02-16 · Velocity 9y
“GPT-4o is a Transformer-based model, the core architecture of all GPT models.”
high
GPT-5
Deployed 2026-02-16 · Velocity 9y
“GPT-5 is a Generative Pre-trained Transformer, fundamentally based on the Transformer architecture.”
high
GPT-5.2 Pro
Deployed 2026-02-17 · Velocity 9y
“GPT-5.2 is a direct successor in the GPT series, which is fundamentally based on the Transformer architecture.”
high
Claude 3
Deployed 2026-02-18 · Velocity 9y
“Claude 3 is built on a Transformer architecture with self-attention, as stated in its technical report.”
high
Gemini 3 Pro
Deployed 2026-02-19 · Velocity 9y
“Gemini is a Transformer-based decoder-only model.”
high
Gemini 3.1
Deployed 2026-02-20 · Velocity 9y
“Gemini is a Transformer-based model, using self-attention as its core architecture.”
high
Claude 3.5 Sonnet
Deployed 2026-02-23 · Velocity 9y
“Claude 3.5 Sonnet is built on transformer architecture with self-attention mechanisms.”
high
Claude Sonnet 4.6
Deployed 2026-02-25 · Velocity 9y
“Claude Sonnet is a Transformer-based large language model.”
high
Claude Haiku 4.5
Deployed 2026-02-25 · Velocity 9y
“Claude models are transformer-based language models.”
high
GPT-5.3
Deployed 2026-02-26 · Velocity 9y
“GPT-5.3 is a Transformer-based model, using self-attention as its core architecture.”
high
Claude 4.5
Deployed 2026-02-26 · Velocity 9y
“Claude is built on transformer architecture with self-attention mechanisms.”
high
Gemini 3 Flash
Deployed 2026-02-27 · Velocity 9y
“Gemini models are based on the Transformer architecture, using decoder-only models with self-attention.”
high
GPT-OSS-120B
Deployed 2026-03-02 · Velocity 9y
“GPT-OSS-120B is a 120-billion parameter model, which fundamentally relies on the transformer self-attention architecture.”
high
Grok 4.20
Deployed 2026-03-02 · Velocity 9y
“Grok is a large language model built on the Transformer architecture.”
high
Kimi K2.5
Deployed 2026-03-04 · Velocity 9y
“Kimi K2.5 is fundamentally a Transformer-based model, using self-attention as its core architecture.”
high
Gemini 3.1 Flash-Lite
Deployed 2026-03-05 · Velocity 9y
“Gemini models are based on the Transformer decoder architecture.”
high
Nemotron 3 Super
Deployed 2026-03-11 · Velocity 9y
“Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.”
high
DeepSeek-R1
Deployed 2026-03-17 · Velocity 9y
“Based on transformer architecture with self-attention mechanisms.”
high
Claude 3.5 Opus
Deployed 2026-03-18 · Velocity 9y
“Claude models are transformer-based; Opus uses standard transformer architecture.”
high
Qwen 3.6
Deployed 2026-03-31 · Velocity 9y
“Qwen 3.6 is a Transformer-based large language model.”
high
GPT-5.4-Cyber
Deployed 2026-04-16 · Velocity 9y
“GPT models are based on the Transformer architecture with self-attention.”
high
Claude Opus 4.7
Deployed 2026-04-16 · Velocity 9y
“All Claude models are based on the Transformer architecture, which is foundational and explicitly stated in their technical documentation.”
high

Deployment timeline

Techniques built on this