The Vision Transformer (ViT) is a deep learning model that applies the Transformer architecture, originally designed for natural language processing, directly to image data. Introduced by Dosovitskiy et al. in 2021 ("An Image is Worth 16x16 Words"), ViT divides an input image into fixed-size non-overlapping patches (e.g., 16×16 pixels), linearly embeds each patch into a vector, and adds a learnable positional encoding to retain spatial information. These patch embeddings are then fed as a sequence into a standard Transformer encoder, which consists of alternating layers of multi-head self-attention (MHSA) and multilayer perceptron (MLP) blocks with residual connections and layer normalization. A special [CLS] token prepended to the sequence is used for classification via an MLP head.
ViT's key innovation is replacing convolutional layers with self-attention over patches, allowing the model to capture long-range dependencies across the entire image from the earliest layers. Early ViTs required large-scale pretraining (e.g., on JFT-300M or ImageNet-21k) to outperform convolutional baselines like ResNet, but later variants (e.g., DeiT, Data-efficient Image Transformers) reduced the data requirement through knowledge distillation and improved training recipes.
Technically, ViT has complexity O(n²) in the number of patches, making it computationally expensive for high-resolution images. To address this, hierarchical vision transformers (e.g., Swin Transformer, PVT) use shifted windows or pyramid structures to reduce self-attention to local regions while maintaining global connectivity. Cross-attention variants like CrossViT merge patch and token representations from multiple scales. Modern ViTs (2026) incorporate convolution-like inductive biases via hybrid architectures (e.g., ConvNeXt, CoAtNet) or explicit positional encoding strategies (e.g., RoPE, AliBi).
ViT is now the dominant backbone for image classification, object detection, and segmentation, often surpassing traditional CNNs on large-scale datasets. It is preferred over CNNs when data is abundant and computational resources allow full self-attention; for smaller datasets, convolutional or hybrid models remain competitive. Common pitfalls include sensitivity to patch size (smaller patches increase sequence length quadratically), need for careful regularization (dropout, stochastic depth), and difficulty with high-resolution inputs without hierarchical downsampling.
As of 2026, state-of-the-art vision transformers include ViT-22B (Google), SwinV2 (Microsoft), and EfficientViT (MIT/Apple). ViTs are widely deployed in autonomous driving (perception pipelines), medical imaging (tumor segmentation), and multimodal models (e.g., CLIP, Flamingo). The architecture has also been extended to video (TimeSformer, VideoMAE) and 3D point clouds (Point Transformer).