CanViT: First Active-Vision Foundation Model Hits 45.9% mIoU on ADE20K with Sequential Glimpses
A new research paper introduces CanViT, the first architecture designed as a scalable Active-Vision Foundation Model (AVFM). Active vision—where a model perceives a scene through a sequence of localized, high-resolution "glimpses" rather than processing an entire static image—has long promised more efficient and biologically plausible computer vision. However, it has been hampered by a lack of general-purpose, scalable architectures and pretraining methods. CanViT, detailed in a new arXiv preprint, aims to close this gap, demonstrating performance that begins to rival traditional "passive" vision models on standard benchmarks while using significantly less computational power per step.
What the Researchers Built: A Canvas for Active Vision
The core innovation of CanViT is its decoupled architecture, which separates "thinking" from "memory." It consists of two main components:
- A retinotopic Vision Transformer (ViT) backbone: This is the "thinking" module. It processes each individual glimpse (a small, localized patch of the scene) independently. To handle the variable locations and scales of these glimpses, the researchers employ scene-relative Rotary Position Embedding (RoPE). This binds the positional information of tokens within a glimpse to the glimpse's absolute coordinates in the overall scene, rather than its position within a fixed grid.
- A spatiotopic "canvas": This is the high-capacity, scene-wide latent workspace or "memory." It's a fixed-size feature map that represents the model's accumulating understanding of the entire scene. Crucially, the canvas does not use self-attention or fully-connected layers, which are computationally expensive and scale poorly with scene size.
The binding mechanism between these two is a novel Canvas Attention module. It's an asymmetric cross-attention operation where the glimpse features (from the backbone) act as queries, and the canvas features act as keys and values. This allows the model to efficiently read from and write to its working memory with each new glimpse, updating its global understanding without recomputing everything from scratch.
Key Results: Closing the Gap with Passive Vision
The researchers pretrained the CanViT-Base variant from random initialization on 13.2 million scenes from ImageNet-21k, generating 1 billion random glimpses with randomized locations, zoom levels, and sequence lengths. This pretraining used a label-free, self-supervised objective called policy-agnostic passive-to-active dense latent distillation. Simply put, the model is trained to reconstruct the full-scene dense features from a passive vision teacher model (DINOv3) using only the sequence of low-resolution glimpses.
After pretraining, the model was evaluated without any task-specific fine-tuning.
ADE20K Semantic Segmentation mIoU (1 glimpse) 38.5% 27.6% +10.9% absolute improvement; uses 19.5x fewer inference FLOPs ADE20K Semantic Segmentation mIoU (multiple glimpses) 45.9% N/A Outperforms a FLOP-matched DINOv3 teacher model ImageNet-1K Classification Top-1 Accuracy 81.2% N/A Using frozen linear probes on the canvas featuresThe results show that with just a single glimpse, CanViT significantly outperforms the previous best active model. More importantly, as it receives more glimpses, its performance climbs to 45.9% mIoU, demonstrating its ability to integrate information over time and begin to approach the performance of passive models that see the entire scene at once, but with a fundamentally more efficient perceptual strategy.
How It Works: Training and Inference
The pretraining pipeline is key to CanViT's generalization. By using a randomized glimpse policy during training—where the location, zoom, and number of glimpses are all varied—the model learns to be agnostic to any specific policy for selecting glimpses. This makes it a true foundation model: it can later be paired with a learned policy (e.g., a reinforcement learning agent) or a fixed heuristic for downstream tasks.
During sequential inference, the process is as follows:
- A new glimpse (image patch) is taken from the scene at a specific (x, y) coordinate and scale.
- The glimpse is processed by the ViT backbone with scene-relative RoPE.
- The resulting glimpse features attend to the current canvas state via Canvas Attention.
- The canvas is updated with the new information.
- A task-specific head (e.g., for segmentation or classification) can read from the canvas at any time to produce an output.
This design eliminates the need for costly self-attention over the entire canvas history, enabling low-latency updates and scalability to very large scenes.
Why It Matters: A New Axis for Vision Research
CanViT represents a concrete step toward making active vision a practical and scalable paradigm. For years, active vision has been confined to narrow tasks or small-scale simulations. This work demonstrates that with the right architecture and large-scale, self-supervised pretraining, active models can perform competitively on standard, challenging benchmarks like ADE20K.
The implications are significant for applications where computational efficiency, latency, or bandwidth are constrained, such as robotics, augmented reality, and mobile vision. Instead of processing megapixel images frame-by-frame, a system could use an active vision model to intelligently sample and integrate visual information over time, drastically reducing compute load.
gentic.news Analysis
This paper, posted to arXiv on March 23, 2026, arrives amidst a consistent stream of foundational AI research published on the platform. In the past week alone, arXiv has hosted studies on topics ranging from mitigating overrefusal in LLMs to new frameworks for sequential recommendation, as covered in our recent articles on PFSR and MI-DPG. The introduction of CanViT as an "Active-Vision Foundation Model" directly contributes to another major trend visible in the arXiv corpus: the development of agentic AI systems. The knowledge graph shows numerous connections between arXiv publications and technologies like reinforcement learning and AI agents. CanViT provides a critical perceptual substrate for such agents, moving beyond passive scene understanding to an active, sequential, and efficient form of perception that is more aligned with how embodied agents operate in the world.
Technically, the choice of DINOv3 as the teacher model is a savvy one. As noted in our knowledge graph, DINOv3 has been mentioned in several prior articles, establishing itself as a robust, self-supervised vision model capable of producing high-quality dense features. CanViT's distillation approach effectively transfers the knowledge of this powerful passive model into an active framework. The reported training time of 166 hours on a single H100 for 1 billion glimpses also sets a new scale benchmark for active vision, being an order of magnitude larger than previous efforts, which is necessary for foundation model claims.
The success of CanViT's decoupled architecture—separating a lightweight, glimpse-processing backbone from a persistent canvas memory—may inspire similar designs in other sequential processing domains. It echoes architectural principles seen in some memory-augmented language models but applies them to the spatially-grounded problem of vision. The next logical step, hinted at by the authors, is to combine CanViT with a learned policy network to create fully autonomous active perception systems, a direction deeply connected to the reinforcement learning and AI agent research frequently published alongside it on arXiv.
Frequently Asked Questions
What is an Active-Vision Foundation Model (AVFM)?
An Active-Vision Foundation Model is a general-purpose neural network architecture designed for active computer vision. Unlike traditional models that process a full, static image in one pass, an AVFM perceives a scene through a sequence of localized, high-resolution "glimpses." It is "foundation" because it is pretrained on large-scale data in a task-agnostic way, allowing it to be adapted for various downstream applications like segmentation or classification, and "policy-agnostic" because it isn't tied to one specific strategy for choosing where to look next.
How is CanViT more efficient than a standard Vision Transformer?
CanViT achieves efficiency through sequential processing and architectural choices. Instead of applying self-attention to a full image's worth of patches (which scales quadratically), it processes only a small glimpse at each step. Its "canvas" memory does not use self-attention, avoiding costly computations over a large latent space. The paper reports that for a single glimpse, CanViT uses 19.5x fewer inference FLOPs than the prior best active model while achieving better accuracy.
What was CanViT trained on?
The CanViT-B model was pretrained from scratch on 13.2 million images from the ImageNet-21k dataset. From these images, the training pipeline generated 1 billion random glimpses. The training objective was not based on human labels but on reconstructing the dense feature maps produced by the DINOv3 vision model for the entire scene, using only the sequence of glimpses.
Can CanViT be used for robotics or real-time applications?
The architecture is designed with such applications in mind. Its low-latency sequential update (due to the lack of canvas self-attention) and scalable design make it a promising candidate for robotics, where an agent must build a scene understanding over time while controlling where it looks. However, the current work focuses on offline evaluation on standard datasets; integrating it with a real-time control policy and sensorimotor loop remains future work.







