Technique · multimodal

LLaVA (Visual Instruction Tuning)

Projecting CLIP features into an LLM's token space via a simple projector + instruction tuning on GPT-4-generated visual conversations.

Origin: University of Wisconsin, 2023-04Read origin paper →Also known as: LLaVA

Products deploying

Avg research → prod

2.0y

First commercial deploy

Deployment timeline

Llama 4 Scout
Deployed 2025-04-05 · Velocity 2.0y
“Natively multimodal (text+image) open-weight model, similar to LLaVA's approach of projecting vision features into LLM.”
medium
Kimi K2.5
Deployed 2026-03-04 · Velocity 3y
“Kimi K2.5 is a multimodal model with vision capabilities, similar to LLaVA's approach of projecting visual features into LLM token space.”
medium
Qwen 3.6
Deployed 2026-03-31 · Velocity 3y
“Qwen 3.6 includes a multimodal version (Qwen-VL) that uses a vision encoder and projector.”
high