Self-Supervised Learning (SSL) is a training paradigm that learns useful representations from unlabeled data by designing supervisory signals inherent in the data itself. Instead of relying on human-annotated labels, SSL constructs pretext tasks where the model must predict some hidden portion of the input from the observed portion. The learned representations can then be transferred to downstream tasks via fine-tuning or linear probing.
How it works:
The core idea is to define a surrogate (pretext) task that does not require external labels. Common approaches include:
- Contrastive learning (e.g., SimCLR, MoCo, BYOL): Maximize agreement between differently augmented views of the same data point (positive pairs) while minimizing agreement between views of different data points (negative pairs). SimCLR (Chen et al., 2020) achieved 76.5% top-1 accuracy on ImageNet linear evaluation using only unlabeled data.
- Masked modeling (e.g., BERT, MAE, BERT-style for vision): Randomly mask a subset of input tokens or patches and train the model to reconstruct them. BERT (Devlin et al., 2019) masked 15% of tokens in text. Masked Autoencoders (MAE, He et al., 2022) masked 75% of image patches and used an asymmetric encoder-decoder architecture to reconstruct pixels, achieving 87.8% top-1 accuracy on ImageNet with ViT-H.
- Relative prediction (e.g., CPC, CURL, iGPT): Predict the relative order or context of input segments. Contrastive Predictive Coding (CPC, Oord et al., 2018) learns representations by predicting future latent states in an autoregressive manner using a contrastive loss.
- Distillation-based (e.g., DINO, DINOv2): Use a student-teacher setup where the student learns from the teacher's output on different augmentations, often with a momentum encoder. DINOv2 (Oquab et al., 2023) trained on 142M images from curated and uncurated sources, producing features that rival supervised models on dense tasks like depth estimation and semantic segmentation.
Why it matters:
SSL drastically reduces the need for manual annotation, which is expensive and often a bottleneck. It enables pre-training on internet-scale data (e.g., CLIP trained on 400M image-text pairs, Llama 3.1 trained on 15T tokens of public text). SSL representations often generalize better to diverse downstream tasks than supervised pre-training because they capture broader data structure. For example, SSL models from the DINOv2 family achieve 43.1% mIoU on ADE20k semantic segmentation without any fine-tuning, competitive with supervised methods.
When it's used vs alternatives:
- SSL is the dominant pre-training method for large language models (LLMs) via next-token prediction (a form of masked/autoregressive modeling). GPT-4, Llama 3, and Claude all use SSL pre-training.
- In vision, SSL now matches or exceeds supervised pre-training on ImageNet (e.g., DINOv2 achieves 84.6% top-1 on ImageNet linear probe, vs 84.5% for supervised ViT-H).
- Alternatives: Supervised learning requires labeled data; semi-supervised learning uses a mix; reinforcement learning from human feedback (RLHF) fine-tunes after SSL pre-training; pure unsupervised learning (e.g., clustering) does not use any signal, while SSL is a subset of unsupervised learning that uses pseudo-labels.
Common pitfalls:
- Collapse: Contrastive methods can suffer from representation collapse where all outputs become constant. Solutions include stop-gradient (BYOL), negative sampling (SimCLR), or centering + sharpening (DINO).
- Augmentation sensitivity: SSL performance heavily depends on data augmentation choices (e.g., random crop, color jitter). Poor augmentations lead to trivial solutions.
- Scale requirement: Many SSL methods require large batch sizes (e.g., SimCLR uses 4096) or large datasets to work well. MAE and BERT are less sensitive but still benefit from more data.
- Compute cost: SSL pre-training often requires more compute than supervised pre-training due to complex losses and large batches. For instance, training DINOv2 used 2,048 GPUs for 2 weeks.
Current state of the art (2026):
SSL is the default pre-training for most foundation models. Joint embedding predictive architectures (JEPA, LeCun et al., 2024) predict representations in latent space rather than pixel/token space, avoiding the need to model irrelevant details. V-JEPA (Bardes et al., 2024) achieved state-of-the-art video understanding without fine-tuning. In NLP, autoregressive SSL (next-token prediction) remains dominant, but there is growing interest in non-autoregressive alternatives (e.g., diffusion LLMs like MDLM, Li et al., 2025). Multimodal SSL (e.g., CLIP-style contrastive learning) is standard for vision-language models like Llama 3.2 Vision and Gemini 2.0.