Technique · architecture
Transformer Self-Attention
A sequence-to-sequence architecture that replaces recurrence with scaled dot-product attention, enabling parallel training and long-range context modeling.
Deployment timeline
- Llama 4 Scouthigh
Deployed 2025-04-05 · Velocity 8y
“All Llama models are Transformer-based; Llama 4 Scout is described as a multimodal MoE model.”
- Llama 4 Maverickhigh
Deployed 2025-04-05 · Velocity 8y
“Llama 4 is a Transformer-based LLM, the core architecture is self-attention.”
- Claude Opus 4.6high
Deployed 2026-02-16 · Velocity 9y
“Claude is based on transformer architecture with self-attention mechanisms.”
- GPT-4ohigh
Deployed 2026-02-16 · Velocity 9y
“GPT-4o is a Transformer-based model, the core architecture of all GPT models.”
- GPT-5high
Deployed 2026-02-16 · Velocity 9y
“GPT-5 is a Generative Pre-trained Transformer, fundamentally based on the Transformer architecture.”
- GPT-5.2 Prohigh
Deployed 2026-02-17 · Velocity 9y
“GPT-5.2 is a direct successor in the GPT series, which is fundamentally based on the Transformer architecture.”
- Claude 3high
Deployed 2026-02-18 · Velocity 9y
“Claude 3 is built on a Transformer architecture with self-attention, as stated in its technical report.”
- high
- Gemini 3.1high
Deployed 2026-02-20 · Velocity 9y
“Gemini is a Transformer-based model, using self-attention as its core architecture.”
- Claude 3.5 Sonnethigh
Deployed 2026-02-23 · Velocity 9y
“Claude 3.5 Sonnet is built on transformer architecture with self-attention mechanisms.”
- Claude Sonnet 4.6high
Deployed 2026-02-25 · Velocity 9y
“Claude Sonnet is a Transformer-based large language model.”
- Claude Haiku 4.5high
Deployed 2026-02-25 · Velocity 9y
“Claude models are transformer-based language models.”
- GPT-5.3high
Deployed 2026-02-26 · Velocity 9y
“GPT-5.3 is a Transformer-based model, using self-attention as its core architecture.”
- Claude 4.5high
Deployed 2026-02-26 · Velocity 9y
“Claude is built on transformer architecture with self-attention mechanisms.”
- Gemini 3 Flashhigh
Deployed 2026-02-27 · Velocity 9y
“Gemini models are based on the Transformer architecture, using decoder-only models with self-attention.”
- GPT-OSS-120Bhigh
Deployed 2026-03-02 · Velocity 9y
“GPT-OSS-120B is a 120-billion parameter model, which fundamentally relies on the transformer self-attention architecture.”
- Grok 4.20high
Deployed 2026-03-02 · Velocity 9y
“Grok is a large language model built on the Transformer architecture.”
- Kimi K2.5high
Deployed 2026-03-04 · Velocity 9y
“Kimi K2.5 is fundamentally a Transformer-based model, using self-attention as its core architecture.”
- Gemini 3.1 Flash-Litehigh
Deployed 2026-03-05 · Velocity 9y
“Gemini models are based on the Transformer decoder architecture.”
- Nemotron 3 Superhigh
Deployed 2026-03-11 · Velocity 9y
“Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture.”
- DeepSeek-R1high
Deployed 2026-03-17 · Velocity 9y
“Based on transformer architecture with self-attention mechanisms.”
- Claude 3.5 Opushigh
Deployed 2026-03-18 · Velocity 9y
“Claude models are transformer-based; Opus uses standard transformer architecture.”
- high
- GPT-5.4-Cyberhigh
Deployed 2026-04-16 · Velocity 9y
“GPT models are based on the Transformer architecture with self-attention.”
- Claude Opus 4.7high
Deployed 2026-04-16 · Velocity 9y
“All Claude models are based on the Transformer architecture, which is foundational and explicitly stated in their technical documentation.”