Technique · multimodal

Vision Transformer (ViT)

Applying a standard Transformer directly to sequences of image patches, establishing Transformers as the dominant image-recognition backbone.

Origin: Google, 2020-10Read origin paper →Also known as: ViT

Products deploying

Avg research → prod

First commercial deploy

Deployment timeline

Gemini 3 Pro
Deployed 2026-02-19 · Velocity 5y
“Gemini uses a Vision Transformer (ViT) to encode image patches.”
high
Gemini 3 Flash
Deployed 2026-02-27 · Velocity 5y
“Gemini models use a Vision Transformer (ViT) architecture for processing visual inputs, as detailed in the technical report.”
high
Kimi K2.5
Deployed 2026-03-04 · Velocity 5y
“As a vision-language model, Kimi K2.5 likely uses Vision Transformer (ViT) for image patch encoding.”
medium
Qwen 3.6
Deployed 2026-03-31 · Velocity 5y
“Qwen 3.6's multimodal version uses a Vision Transformer (ViT) as its vision encoder.”
high