Skip to content
gentic.news — AI News Intelligence Platform

Technique · multimodal

Vision Transformer (ViT)

Applying a standard Transformer directly to sequences of image patches, establishing Transformers as the dominant image-recognition backbone.

Origin: Google, 2020-10Read origin paper →Also known as: ViT
4
Products deploying
5y
Avg research → prod
5y
First commercial deploy

Deployment timeline

  1. Gemini 3 Pro

    Deployed 2026-02-19 · Velocity 5y

    Gemini uses a Vision Transformer (ViT) to encode image patches.

    high
  2. Gemini 3 Flash

    Deployed 2026-02-27 · Velocity 5y

    Gemini models use a Vision Transformer (ViT) architecture for processing visual inputs, as detailed in the technical report.

    high
  3. Kimi K2.5

    Deployed 2026-03-04 · Velocity 5y

    As a vision-language model, Kimi K2.5 likely uses Vision Transformer (ViT) for image patch encoding.

    medium
  4. Qwen 3.6

    Deployed 2026-03-31 · Velocity 5y

    Qwen 3.6's multimodal version uses a Vision Transformer (ViT) as its vision encoder.

    high

Techniques built on this