CanViT: First Active-Vision Foundation Model Hits 45.9% mIoU on ADE20K with Sequential Glimpses

Researchers introduce CanViT, the first task- and policy-agnostic Active-Vision Foundation Model (AVFM). It achieves 38.5% mIoU on ADE20K segmentation with a single low-resolution glimpse, outperforming prior active models while using 19.5x fewer FLOPs.

AAAla SMITH & AI Research Desk·Mar 25, 2026·8 min read··170 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_cvMulti-Source

A new research paper introduces CanViT, the first architecture designed as a scalable Active-Vision Foundation Model (AVFM). Active vision—where a model perceives a scene through a sequence of localized, high-resolution "glimpses" rather than processing an entire static image—has long promised more efficient and biologically plausible computer vision. However, it has been hampered by a lack of general-purpose, scalable architectures and pretraining methods. CanViT, detailed in a new arXiv preprint, aims to close this gap, demonstrating performance that begins to rival traditional "passive" vision models on standard benchmarks while using significantly less computational power per step.

What the Researchers Built: A Canvas for Active Vision

The core innovation of CanViT is its decoupled architecture, which separates "thinking" from "memory." It consists of two main components:

A retinotopic Vision Transformer (ViT) backbone: This is the "thinking" module. It processes each individual glimpse (a small, localized patch of the scene) independently. To handle the variable locations and scales of these glimpses, the researchers employ scene-relative Rotary Position Embedding (RoPE). This binds the positional information of tokens within a glimpse to the glimpse's absolute coordinates in the overall scene, rather than its position within a fixed grid.
A spatiotopic "canvas": This is the high-capacity, scene-wide latent workspace or "memory." It's a fixed-size feature map that represents the model's accumulating understanding of the entire scene. Crucially, the canvas does not use self-attention or fully-connected layers, which are computationally expensive and scale poorly with scene size.

The binding mechanism between these two is a novel Canvas Attention module. It's an asymmetric cross-attention operation where the glimpse features (from the backbone) act as queries, and the canvas features act as keys and values. This allows the model to efficiently read from and write to its working memory with each new glimpse, updating its global understanding without recomputing everything from scratch.

Key Results: Closing the Gap with Passive Vision

The researchers pretrained the CanViT-Base variant from random initialization on 13.2 million scenes from ImageNet-21k, generating 1 billion random glimpses with randomized locations, zoom levels, and sequence lengths. This pretraining used a label-free, self-supervised objective called policy-agnostic passive-to-active dense latent distillation. Simply put, the model is trained to reconstruct the full-scene dense features from a passive vision teacher model (DINOv3) using only the sequence of low-resolution glimpses.

After pretraining, the model was evaluated without any task-specific fine-tuning.

ADE20K Semantic Segmentation mIoU (1 glimpse) 38.5% 27.6% +10.9% absolute improvement; uses 19.5x fewer inference FLOPs ADE20K Semantic Segmentation mIoU (multiple glimpses) 45.9% N/A Outperforms a FLOP-matched DINOv3 teacher model ImageNet-1K Classification Top-1 Accuracy 81.2% N/A Using frozen linear probes on the canvas features

The results show that with just a single glimpse, CanViT significantly outperforms the previous best active model. More importantly, as it receives more glimpses, its performance climbs to 45.9% mIoU, demonstrating its ability to integrate information over time and begin to approach the performance of passive models that see the entire scene at once, but with a fundamentally more efficient perceptual strategy.

How It Works: Training and Inference

The pretraining pipeline is key to CanViT's generalization. By using a randomized glimpse policy during training—where the location, zoom, and number of glimpses are all varied—the model learns to be agnostic to any specific policy for selecting glimpses. This makes it a true foundation model: it can later be paired with a learned policy (e.g., a reinforcement learning agent) or a fixed heuristic for downstream tasks.

During sequential inference, the process is as follows:

A new glimpse (image patch) is taken from the scene at a specific (x, y) coordinate and scale.
The glimpse is processed by the ViT backbone with scene-relative RoPE.
The resulting glimpse features attend to the current canvas state via Canvas Attention.
The canvas is updated with the new information.
A task-specific head (e.g., for segmentation or classification) can read from the canvas at any time to produce an output.

This design eliminates the need for costly self-attention over the entire canvas history, enabling low-latency updates and scalability to very large scenes.

Why It Matters: A New Axis for Vision Research

CanViT represents a concrete step toward making active vision a practical and scalable paradigm. For years, active vision has been confined to narrow tasks or small-scale simulations. This work demonstrates that with the right architecture and large-scale, self-supervised pretraining, active models can perform competitively on standard, challenging benchmarks like ADE20K.

The implications are significant for applications where computational efficiency, latency, or bandwidth are constrained, such as robotics, augmented reality, and mobile vision. Instead of processing megapixel images frame-by-frame, a system could use an active vision model to intelligently sample and integrate visual information over time, drastically reducing compute load.

gentic.news Analysis

This paper, posted to arXiv on March 23, 2026, arrives amidst a consistent stream of foundational AI research published on the platform. In the past week alone, arXiv has hosted studies on topics ranging from mitigating overrefusal in LLMs to new frameworks for sequential recommendation, as covered in our recent articles on PFSR and MI-DPG. The introduction of CanViT as an "Active-Vision Foundation Model" directly contributes to another major trend visible in the arXiv corpus: the development of agentic AI systems. The knowledge graph shows numerous connections between arXiv publications and technologies like reinforcement learning and AI agents. CanViT provides a critical perceptual substrate for such agents, moving beyond passive scene understanding to an active, sequential, and efficient form of perception that is more aligned with how embodied agents operate in the world.

Technically, the choice of DINOv3 as the teacher model is a savvy one. As noted in our knowledge graph, DINOv3 has been mentioned in several prior articles, establishing itself as a robust, self-supervised vision model capable of producing high-quality dense features. CanViT's distillation approach effectively transfers the knowledge of this powerful passive model into an active framework. The reported training time of 166 hours on a single H100 for 1 billion glimpses also sets a new scale benchmark for active vision, being an order of magnitude larger than previous efforts, which is necessary for foundation model claims.

The success of CanViT's decoupled architecture—separating a lightweight, glimpse-processing backbone from a persistent canvas memory—may inspire similar designs in other sequential processing domains. It echoes architectural principles seen in some memory-augmented language models but applies them to the spatially-grounded problem of vision. The next logical step, hinted at by the authors, is to combine CanViT with a learned policy network to create fully autonomous active perception systems, a direction deeply connected to the reinforcement learning and AI agent research frequently published alongside it on arXiv.

Frequently Asked Questions

What is an Active-Vision Foundation Model (AVFM)?

An Active-Vision Foundation Model is a general-purpose neural network architecture designed for active computer vision. Unlike traditional models that process a full, static image in one pass, an AVFM perceives a scene through a sequence of localized, high-resolution "glimpses." It is "foundation" because it is pretrained on large-scale data in a task-agnostic way, allowing it to be adapted for various downstream applications like segmentation or classification, and "policy-agnostic" because it isn't tied to one specific strategy for choosing where to look next.

How is CanViT more efficient than a standard Vision Transformer?

CanViT achieves efficiency through sequential processing and architectural choices. Instead of applying self-attention to a full image's worth of patches (which scales quadratically), it processes only a small glimpse at each step. Its "canvas" memory does not use self-attention, avoiding costly computations over a large latent space. The paper reports that for a single glimpse, CanViT uses 19.5x fewer inference FLOPs than the prior best active model while achieving better accuracy.

What was CanViT trained on?

The CanViT-B model was pretrained from scratch on 13.2 million images from the ImageNet-21k dataset. From these images, the training pipeline generated 1 billion random glimpses. The training objective was not based on human labels but on reconstructing the dense feature maps produced by the DINOv3 vision model for the entire scene, using only the sequence of glimpses.

Can CanViT be used for robotics or real-time applications?

The architecture is designed with such applications in mind. Its low-latency sequential update (due to the lack of canvas self-attention) and scalable design make it a promising candidate for robotics, where an agent must build a scene understanding over time while controlling where it looks. However, the current work focuses on offline evaluation on standard datasets; integrating it with a real-time control policy and sensorimotor loop remains future work.

Source: gentic.news · Mar 25, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The CanViT paper represents a meaningful architectural advance rather than an incremental benchmark improvement. Its primary contribution is providing a credible, scalable blueprint for active vision, a subfield that has struggled to escape niche status. The decoupling of the glimpse-level transformer from the canvas memory is the key insight, directly addressing the computational bottlenecks that have plagued previous attempts at sequential scene understanding. By eliminating self-attention on the canvas, the authors sidestep the quadratic scaling problem, making large-scene processing feasible. This work sits at an interesting intersection of trends noted in our arXiv knowledge graph. First, it leverages the established power of self-supervised vision models (DINOv3) for distillation, a technique that has become standard for transferring capabilities. Second, its sequential, memory-augmented design aligns with the growing research focus on AI agents and embodied AI, a domain with heavy representation in recent arXiv submissions. CanViT can be seen as providing the 'perception engine' for the next generation of agent research we've been covering, such as studies on AI executing complex tasks. The claim of policy-agnosticism is crucial; it positions CanViT not as a final system, but as a foundational component that can be paired with various downstream policy modules, from classical computer vision heuristics to learned RL agents. Practitioners should note the training scale: 1 billion glimpses on 13.2 million scenes. This underscores that effective foundation models for active vision require massive, diverse pretraining, similar to their passive counterparts. The 166-hour single-H100 runtime is a useful data point for estimating the resource threshold for this line of research. The immediate implication is for researchers in robotics and efficient vision: CanViT offers a new, high-performing baseline architecture. The longer-term implication is that the gap between passive and active vision on semantic tasks is now demonstrably closable, which may shift the cost-benefit analysis for designing real-world vision systems where power and latency are critical constraints.

#active-perception #transformer #research #computer-vision #arxiv

Mentioned in this article

CanViT Vision Transformer

Enjoyed this article?