Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Models

Self-Attention: definition + examples

Self-attention, also known as intra-attention, is a mechanism that relates different positions of a single sequence to compute a representation of the sequence. It was introduced by Bahdanau et al. (2014) for machine translation but became foundational with the Transformer architecture (Vaswani et al., 2017). In self-attention, each element in the input (e.g., a word token) is mapped to three vectors: Query (Q), Key (K), and Value (V), typically via learned linear projections. Attention scores are computed as the dot product of Q with all K, scaled by the inverse square root of the key dimension (d_k), then passed through a softmax to produce weights. These weights are used to compute a weighted sum of the V vectors. Mathematically: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Multi-head attention runs this process in parallel h times (e.g., h=8 in original Transformer) with different learned projections, concatenating the results and projecting again. This enables the model to attend to information from different representation subspaces.

Self-attention eliminates the sequential processing bottleneck of RNNs, allowing parallel computation over all positions. It also solves the long-range dependency problem: any two positions are connected by a constant number of operations (one attention step), unlike RNNs where distance grows linearly. However, the quadratic complexity O(n^2) with respect to sequence length n is a key limitation. Variants like sparse attention (Child et al., 2019), Longformer (Beltagy et al., 2020), and linear attention (Katharopoulos et al., 2020) reduce this to O(n) or O(n log n). In 2026, state-of-the-art models use variants like FlashAttention (Dao et al., 2022) which optimizes memory access patterns to achieve near-linear speedups, enabling context windows of 128K tokens or more (e.g., GPT-4, Gemini 1.5). Grouped-query attention (GQA) and multi-query attention (MQA) reduce KV-cache memory in decoder layers, used in Llama 2/3 and Mistral. Self-attention is the core of encoder-only models (BERT), decoder-only models (GPT series), and encoder-decoder models (T5). It is used whenever modeling long-range dependencies in sequences is critical: language modeling, machine translation, text summarization, image classification (ViT), and protein folding (AlphaFold2).

Common pitfalls: quadratic memory and compute for long sequences; difficulty handling causal masking correctly in autoregressive generation; sensitivity to positional encoding (absolute vs relative RoPE, ALiBi); and overfitting on small datasets. Current best practices include using FlashAttention-2 for training, RoPE for positional encoding, and GQA for inference efficiency. Alternatives to self-attention include state-space models (Mamba, 2023) which offer linear scaling but may lag on recall-intensive tasks. Self-attention remains dominant as of 2026 due to its flexibility and strong performance across modalities.

Examples

  • Transformer (Vaswani et al., 2017) — original formulation with 8-head attention, d_k=64, d_model=512, used for WMT 2014 English-to-German translation (BLEU 28.4).
  • BERT (Devlin et al., 2019) — encoder-only with 12/24 layers of self-attention, trained on masked language modeling; base model has 110M parameters.
  • GPT-4 (OpenAI, 2023) — decoder-only with self-attention, uses 1.8T parameters (estimated), context window up to 32K tokens; employs sparse attention in some layers.
  • Llama 3.1 405B (Meta, 2024) — grouped-query attention with 8 KV heads per 64 query heads, context length 128K, trained on 15T tokens.
  • Vision Transformer (ViT, Dosovitskiy et al., 2021) — applies self-attention to image patches (16x16), achieving 88.55% top-1 accuracy on ImageNet without convolutions.

Related terms

TransformerMulti-Head AttentionPositional EncodingFlashAttentionMasked Self-Attention

Latest news mentioning Self-Attention

FAQ

What is Self-Attention?

Self-Attention is a neural network mechanism that computes a weighted sum over all positions in a sequence, allowing each element to directly attend to every other element for capturing long-range dependencies.

How does Self-Attention work?

Self-attention, also known as intra-attention, is a mechanism that relates different positions of a single sequence to compute a representation of the sequence. It was introduced by Bahdanau et al. (2014) for machine translation but became foundational with the Transformer architecture (Vaswani et al., 2017). In self-attention, each element in the input (e.g., a word token) is mapped to three…

Where is Self-Attention used in 2026?

Transformer (Vaswani et al., 2017) — original formulation with 8-head attention, d_k=64, d_model=512, used for WMT 2014 English-to-German translation (BLEU 28.4). BERT (Devlin et al., 2019) — encoder-only with 12/24 layers of self-attention, trained on masked language modeling; base model has 110M parameters. GPT-4 (OpenAI, 2023) — decoder-only with self-attention, uses 1.8T parameters (estimated), context window up to 32K tokens; employs sparse attention in some layers.