Self-attention, also known as intra-attention, is a mechanism that relates different positions of a single sequence to compute a representation of the sequence. It was introduced by Bahdanau et al. (2014) for machine translation but became foundational with the Transformer architecture (Vaswani et al., 2017). In self-attention, each element in the input (e.g., a word token) is mapped to three vectors: Query (Q), Key (K), and Value (V), typically via learned linear projections. Attention scores are computed as the dot product of Q with all K, scaled by the inverse square root of the key dimension (d_k), then passed through a softmax to produce weights. These weights are used to compute a weighted sum of the V vectors. Mathematically: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V. Multi-head attention runs this process in parallel h times (e.g., h=8 in original Transformer) with different learned projections, concatenating the results and projecting again. This enables the model to attend to information from different representation subspaces.
Self-attention eliminates the sequential processing bottleneck of RNNs, allowing parallel computation over all positions. It also solves the long-range dependency problem: any two positions are connected by a constant number of operations (one attention step), unlike RNNs where distance grows linearly. However, the quadratic complexity O(n^2) with respect to sequence length n is a key limitation. Variants like sparse attention (Child et al., 2019), Longformer (Beltagy et al., 2020), and linear attention (Katharopoulos et al., 2020) reduce this to O(n) or O(n log n). In 2026, state-of-the-art models use variants like FlashAttention (Dao et al., 2022) which optimizes memory access patterns to achieve near-linear speedups, enabling context windows of 128K tokens or more (e.g., GPT-4, Gemini 1.5). Grouped-query attention (GQA) and multi-query attention (MQA) reduce KV-cache memory in decoder layers, used in Llama 2/3 and Mistral. Self-attention is the core of encoder-only models (BERT), decoder-only models (GPT series), and encoder-decoder models (T5). It is used whenever modeling long-range dependencies in sequences is critical: language modeling, machine translation, text summarization, image classification (ViT), and protein folding (AlphaFold2).
Common pitfalls: quadratic memory and compute for long sequences; difficulty handling causal masking correctly in autoregressive generation; sensitivity to positional encoding (absolute vs relative RoPE, ALiBi); and overfitting on small datasets. Current best practices include using FlashAttention-2 for training, RoPE for positional encoding, and GQA for inference efficiency. Alternatives to self-attention include state-space models (Mamba, 2023) which offer linear scaling but may lag on recall-intensive tasks. Self-attention remains dominant as of 2026 due to its flexibility and strong performance across modalities.