Technique · multimodal
CLIP (Contrastive Language-Image Pretraining)
Dual-encoder model trained on 400M image-caption pairs to align image and text embeddings, enabling zero-shot visual classification.
1
Products deploying
4y
Avg research → prod
4y
First commercial deploy
Deployment timeline
- Llama 4 Scoutmedium
Deployed 2025-04-05 · Velocity 4y
“Multimodal (text+image) capability suggests use of vision-language alignment similar to CLIP.”