Technique · multimodal
Vision Transformer (ViT)
Applying a standard Transformer directly to sequences of image patches, establishing Transformers as the dominant image-recognition backbone.
4
Products deploying
5y
Avg research → prod
5y
First commercial deploy
Deployment timeline
- Gemini 3 Prohigh
Deployed 2026-02-19 · Velocity 5y
“Gemini uses a Vision Transformer (ViT) to encode image patches.”
- Gemini 3 Flashhigh
Deployed 2026-02-27 · Velocity 5y
“Gemini models use a Vision Transformer (ViT) architecture for processing visual inputs, as detailed in the technical report.”
- Kimi K2.5medium
Deployed 2026-03-04 · Velocity 5y
“As a vision-language model, Kimi K2.5 likely uses Vision Transformer (ViT) for image patch encoding.”
- Qwen 3.6high
Deployed 2026-03-31 · Velocity 5y
“Qwen 3.6's multimodal version uses a Vision Transformer (ViT) as its vision encoder.”