Technique · multimodal

CLIP (Contrastive Language-Image Pretraining)

Dual-encoder model trained on 400M image-caption pairs to align image and text embeddings, enabling zero-shot visual classification.

Origin: OpenAI, 2021-02Read origin paper →Also known as: CLIP

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

Llama 4 Scout
Deployed 2025-04-05 · Velocity 4y
“Multimodal (text+image) capability suggests use of vision-language alignment similar to CLIP.”
medium