Technique · multimodal
LLaVA (Visual Instruction Tuning)
Projecting CLIP features into an LLM's token space via a simple projector + instruction tuning on GPT-4-generated visual conversations.
3
Products deploying
3y
Avg research → prod
2.0y
First commercial deploy
Deployment timeline
- Llama 4 Scoutmedium
Deployed 2025-04-05 · Velocity 2.0y
“Natively multimodal (text+image) open-weight model, similar to LLaVA's approach of projecting vision features into LLM.”
- Kimi K2.5medium
Deployed 2026-03-04 · Velocity 3y
“Kimi K2.5 is a multimodal model with vision capabilities, similar to LLaVA's approach of projecting visual features into LLM token space.”
- Qwen 3.6high
Deployed 2026-03-31 · Velocity 3y
“Qwen 3.6 includes a multimodal version (Qwen-VL) that uses a vision encoder and projector.”