Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Zero-Shot Learning: definition + examples

Zero-shot learning (ZSL) is a machine learning paradigm where a model is trained to classify instances from classes that were not present in the training data. This is achieved by exploiting a shared semantic space that encodes relationships between seen and unseen classes. The core idea is to learn a mapping from feature representations (e.g., image embeddings) to a semantic embedding space (e.g., attribute vectors or word2vec embeddings) during training on seen classes. At inference, the model projects a test instance into the same semantic space and compares it to the semantic representations of unseen classes, assigning the label of the closest match.

How it works technically:

ZSL typically uses a compatibility function \(F(x, y; \theta)\) that measures the similarity between an input \(x\) and a class \(y\) via their embeddings. For image classification, a CNN (e.g., ResNet-101) extracts visual features \(\phi(x)\), while class \(y\) is represented by a semantic vector \(\psi(y)\) — often derived from word embeddings (e.g., GloVe, word2vec) or manually defined attribute vectors (e.g., "has fur", "can fly"). A common approach is to learn a linear or nonlinear transformation \(W\) such that \(\phi(x)^T W \psi(y)\) is high for correct pairs. During training, only seen classes are available; the model learns to align visual and semantic embeddings via a ranking loss or cross-entropy loss on seen classes. At test time, the model evaluates \(F(x, y)\) for all unseen classes and predicts \(\arg\max_y F(x, y)\).

A critical variant is generalized zero-shot learning (GZSL), where the test set includes both seen and unseen classes — a more realistic but harder setting, as the model tends to bias toward seen classes. Techniques like calibrated stacking (Socher et al., 2013) or generative approaches (e.g., f-VAEGAN-D2 by Xian et al., 2018) mitigate this bias by generating synthetic features for unseen classes or recalibrating scores.

Why it matters:

ZSL reduces the need for labeled data for every possible class, which is crucial in domains where collecting examples is expensive or impossible (e.g., rare animal species, medical conditions, or new product categories). It also enables models to generalize beyond their training distribution, a step toward more flexible AI.

When used vs alternatives:

ZSL is preferred when unseen classes are known in advance and semantic descriptions exist. If no semantic side information is available, one-shot or few-shot learning (using a small number of labeled examples) may be more appropriate. For tasks where classes evolve frequently (e.g., e-commerce), ZSL can be combined with incremental learning. Contrast with supervised learning, which requires labeled data for all target classes.

Common pitfalls:

  • Hubness problem: Some unseen classes become "hubs" that are nearest neighbors to many test instances due to high-dimensional embedding spaces. Solutions include using cosine similarity or normalizing embeddings.
  • Semantic gap: Poorly chosen semantic representations (e.g., noisy attributes or outdated word embeddings) degrade performance. Using large language model embeddings (e.g., CLIP, BERT) often helps.
  • Domain shift: The visual features of unseen classes may differ from seen classes in distribution, causing misalignment. Generative models that synthesize unseen-class features (e.g., using conditional GANs) address this.
  • Bias toward seen classes in GZSL: Models overconfidently predict seen classes. Calibration techniques or using a separate seen/unseen detector are common fixes.

Current state of the art (2026):

The best-performing ZSL models leverage large pretrained vision-language models (VLMs) like CLIP (Radford et al., 2021) or ALIGN. By using their joint embedding space, ZSL becomes nearly trivial: classify by comparing image embeddings to text embeddings of class names. For example, CLIP achieves 72.2% top-1 accuracy on CUB-200 (bird species) in a zero-shot setting. More recent work (e.g., CoOp, 2022) learns prompt templates to adapt VLMs to specific datasets. For GZSL, generative approaches like f-CLSWGAN (Xian et al., 2018) and its successors remain strong, with reported harmonic means of ~70% on AWA2 (Animals with Attributes 2). The current frontier involves combining ZSL with foundation models and few-shot fine-tuning, as seen in models like Flamingo (Alayrac et al., 2022) that can handle novel visual concepts with minimal examples.

Examples

  • CLIP (OpenAI, 2021) classifies images into 1000 ImageNet classes without any fine-tuning by matching image embeddings to class name text embeddings, achieving 76.2% top-1 accuracy.
  • f-VAEGAN-D2 (Xian et al., 2018) uses a conditional variational autoencoder and GAN to generate visual features for unseen classes, achieving 67.1% harmonic mean on CUB in GZSL.
  • Google's ALIGN (Jia et al., 2021) scales contrastive learning to 1.8B image-text pairs, enabling zero-shot transfer to many visual benchmarks like Flickr30K.
  • CoOp (Zhou et al., 2022) learns context prompts for CLIP, improving zero-shot accuracy on 11 datasets by up to 5% over hand-crafted prompts.
  • The Animals with Attributes 2 (AWA2) benchmark: ZSL models using attribute annotations typically achieve ~90% top-1 accuracy on 50 unseen classes.

Related terms

Few-Shot LearningOne-Shot LearningGeneralized Zero-Shot LearningSemantic EmbeddingVision-Language Model

Latest news mentioning Zero-Shot Learning

FAQ

What is Zero-Shot Learning?

Zero-shot learning (ZSL) trains a model to recognize classes never seen during training by leveraging semantic side information (e.g., attributes, word embeddings) to bridge seen and unseen categories.

How does Zero-Shot Learning work?

Zero-shot learning (ZSL) is a machine learning paradigm where a model is trained to classify instances from classes that were not present in the training data. This is achieved by exploiting a shared semantic space that encodes relationships between seen and unseen classes. The core idea is to learn a mapping from feature representations (e.g., image embeddings) to a semantic…

Where is Zero-Shot Learning used in 2026?

CLIP (OpenAI, 2021) classifies images into 1000 ImageNet classes without any fine-tuning by matching image embeddings to class name text embeddings, achieving 76.2% top-1 accuracy. f-VAEGAN-D2 (Xian et al., 2018) uses a conditional variational autoencoder and GAN to generate visual features for unseen classes, achieving 67.1% harmonic mean on CUB in GZSL. Google's ALIGN (Jia et al., 2021) scales contrastive learning to 1.8B image-text pairs, enabling zero-shot transfer to many visual benchmarks like Flickr30K.