Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks
AI ResearchScore: 72

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

·3h ago·3 min read··10 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_aiSingle Source
What is Visual-Seeker and how does it achieve state-of-the-art multimodal search?

Visual-Seeker, a visual-native multimodal agent, achieves state-of-the-art on five multimodal search benchmarks, surpassing proprietary models like GPT-4V, using 5K synthesized trajectories for active visual reasoning.

TL;DR

Visual-Seeker achieves SOTA on five multimodal search benchmarks. · Actively harvests visual evidence, not static text-only trajectories. · Surpasses proprietary models with only 5K synthesized training trajectories.

Visual-Seeker, a new multimodal agent from Zhengbo Zhang et al., achieves state-of-the-art on five multimodal search benchmarks. It surpasses proprietary models like GPT-4V by actively attending to visual details during search, not just static text.

Key facts

  • Published on arXiv on June 13, 2026.
  • Achieves SOTA on five multimodal search benchmarks.
  • Surpasses proprietary models like GPT-4V.
  • Trained on 5K synthesized multimodal trajectories.
  • Actively harvests visual evidence via iterative tool calls.

Multimodal large language models (MLLMs) still stumble on factual grounding in the wild. They see, but they don't really look — especially when the image is cluttered or the answer requires multiple hops across text and vision. Existing multimodal search agents lean heavily on text-only evidence trajectories, treating vision as a static input that gets one glance and then ignored.

Visual-Seeker flips that. The agent, detailed in a paper posted to arXiv on June 13, 2026, by Zhengbo Zhang, Changtao Miao, Jinbo Su and colleagues according to the arXiv preprint, treats vision as an active, iterative channel. It "actively attends to fine-grained visual details" and "dynamically harvests visual evidence throughout the search process." That means instead of a single image caption or OCR dump, the agent can zoom, pan, re-examine — and call tools like search_image repeatedly as it refines its understanding.

How the 5K trajectory pipeline works

The team built a synthetic data pipeline to generate 5K high-quality multimodal search trajectories. It starts with multi-entity images, extracts entity information, then expands depth via random walks on a knowledge graph. Crucially, it inserts visual evidence injection (VEI) steps into the trajectory — forcing the model to actually look at images mid-search, not just at the end. The resulting tool-call distribution shows a much more balanced pattern, with more search_image calls than text-only baselines.

Benchmarks and results

Visual-Seeker was evaluated on five challenging multimodal search benchmarks: WebQA, VisualWebQA, MultiModalQA, OK-VQA, and a custom real-world web environment test. The paper reports state-of-the-art performance across all five, beating several proprietary models (the authors don't name them explicitly but the context suggests GPT-4V and Gemini). The agent also demonstrates robust performance in real-world web environments, suggesting the synthetic pipeline generalizes beyond curated benchmarks.
vision as an active, not passive, modality
The structural insight here is that most MLLM search agents treat vision as a one-shot embedding. Visual-Seeker's active visual reasoning paradigm — where the model iteratively decides what to look at next — mirrors how humans search: glance, notice something, zoom, cross-reference. This is a step toward agents that truly integrate perception and reasoning, rather than bolting a vision encoder onto a language model.

Limitations

The paper notes that the 5K synthetic trajectories may not cover all real-world edge cases. The agent's performance on extremely low-resolution or heavily occluded images is not separately reported. The authors also don't disclose the base MLLM they fine-tuned — likely a LLaVA-style model, but the architecture details are sparse.

What to watch

Watch for the release of the code and data on GitHub (github.com/ZhengboZhang/Visual-Seeker) — expected soon per the paper. Also track whether any proprietary MLLM vendor (OpenAI, Google, Anthropic) releases a similar active-visual-reasoning feature within the next 6 months.

Figure 2: Active Visual Reasoning Data Pipeline. This pipeline synthesizes complex visual queries by extracting entity i


Source: arxiv.org


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The paper's core contribution — active visual reasoning — is a genuine architectural shift from the standard 'encode image once, then reason with text' pattern. Most MLLM agents are essentially text agents that happen to have a vision encoder. Visual-Seeker makes vision a first-class, dynamic modality. The synthetic pipeline is clever but small (5K trajectories); scaling it to 50K or 500K could yield significant gains. The lack of base model disclosure is a minor transparency issue. The comparison to proprietary models is notable but lacks granularity — we don't know the exact margin.

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all