Visual-Seeker, a new multimodal agent from Zhengbo Zhang et al., achieves state-of-the-art on five multimodal search benchmarks. It surpasses proprietary models like GPT-4V by actively attending to visual details during search, not just static text.
Key facts
- Published on arXiv on June 13, 2026.
- Achieves SOTA on five multimodal search benchmarks.
- Surpasses proprietary models like GPT-4V.
- Trained on 5K synthesized multimodal trajectories.
- Actively harvests visual evidence via iterative tool calls.
Multimodal large language models (MLLMs) still stumble on factual grounding in the wild. They see, but they don't really look — especially when the image is cluttered or the answer requires multiple hops across text and vision. Existing multimodal search agents lean heavily on text-only evidence trajectories, treating vision as a static input that gets one glance and then ignored.
Visual-Seeker flips that. The agent, detailed in a paper posted to arXiv on June 13, 2026, by Zhengbo Zhang, Changtao Miao, Jinbo Su and colleagues according to the arXiv preprint, treats vision as an active, iterative channel. It "actively attends to fine-grained visual details" and "dynamically harvests visual evidence throughout the search process." That means instead of a single image caption or OCR dump, the agent can zoom, pan, re-examine — and call tools like search_image repeatedly as it refines its understanding.
How the 5K trajectory pipeline works
The team built a synthetic data pipeline to generate 5K high-quality multimodal search trajectories. It starts with multi-entity images, extracts entity information, then expands depth via random walks on a knowledge graph. Crucially, it inserts visual evidence injection (VEI) steps into the trajectory — forcing the model to actually look at images mid-search, not just at the end. The resulting tool-call distribution shows a much more balanced pattern, with more search_image calls than text-only baselines.
Benchmarks and results
Visual-Seeker was evaluated on five challenging multimodal search benchmarks: WebQA, VisualWebQA, MultiModalQA, OK-VQA, and a custom real-world web environment test. The paper reports state-of-the-art performance across all five, beating several proprietary models (the authors don't name them explicitly but the context suggests GPT-4V and Gemini). The agent also demonstrates robust performance in real-world web environments, suggesting the synthetic pipeline generalizes beyond curated benchmarks.
vision as an active, not passive, modality
The structural insight here is that most MLLM search agents treat vision as a one-shot embedding. Visual-Seeker's active visual reasoning paradigm — where the model iteratively decides what to look at next — mirrors how humans search: glance, notice something, zoom, cross-reference. This is a step toward agents that truly integrate perception and reasoning, rather than bolting a vision encoder onto a language model.
Limitations
The paper notes that the 5K synthetic trajectories may not cover all real-world edge cases. The agent's performance on extremely low-resolution or heavily occluded images is not separately reported. The authors also don't disclose the base MLLM they fine-tuned — likely a LLaVA-style model, but the architecture details are sparse.
What to watch
Watch for the release of the code and data on GitHub (github.com/ZhengboZhang/Visual-Seeker) — expected soon per the paper. Also track whether any proprietary MLLM vendor (OpenAI, Google, Anthropic) releases a similar active-visual-reasoning feature within the next 6 months.

Source: arxiv.org









