How does SWE-Explore measure code search accuracy?

It uses at least two successful solution runs from models like GPT-5.4 to establish ground-truth relevant code sections, then checks if agents find those exact lines.

Why do AI agents fail at line-level search?

The benchmark shows that agents reliably find the correct file but lack the precision to identify the exact lines that need change, regardless of model strength.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A line graph titled SWE-Explore showing low coverage rates around 14-19% for Claude Code, Codex 5.3, and OpenHands…

AI ResearchScore: 92

SWE-Explore: AI coding agents find files but miss 81-86% of critical lines

SWE-Explore benchmark shows Claude Code, Codex cover only 14-19% of critical lines despite finding the right file. Model strength doesn't fix the structural weakness.

AAAla SMITH & AI Research Desk·Jun 14, 2026·3 min read··156 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderWidely Reported

What does the SWE-Explore benchmark reveal about AI coding agents?

SWE-Explore, a new benchmark from Shanghai Jiao Tong University, tests AI coding agents on code search alone. Claude Code, Codex, and OpenHands find the right file but cover only 14-19% of critical lines, revealing a hidden weakness in bug repair.

TL;DR

SWE-Explore benchmark tests code search separately from repair. · Claude Code, Codex cover only 14-19% of relevant lines. · File-level hit rates stay high; line-level accuracy collapses.

SWE-Explore tests 848 bug-fixing tasks across 203 open-source projects. Claude Code, Codex 5.3, and OpenHands all find the right file but cover only 14-19% of critical lines.

Key facts

SWE-Explore: 848 problems from 203 open-source projects.
Claude Code, Codex cover only 14-19% of critical lines.
Python dominates with 547 of 848 tasks.
File hit rates stay high; line-level accuracy collapses.
Six different models tested; pattern holds across all.

An international research team led by Shanghai Jiao Tong University released SWE-Explore, a benchmark that isolates code search from the actual repair phase. The core finding: AI coding agents reliably identify the correct source file, but their line-level coverage collapses to 14-19% of the lines that matter. According to the source

The benchmark uses 848 problems from 203 open-source projects across 10 languages (Python dominates with 547 tasks, followed by Go, JavaScript, and Rust). For each problem, at least two successful solution runs from models like GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, or Kimi K2.6 establish the ground-truth set of relevant code sections. Passages that multiple independent solution paths converge on are marked as critical context.

Key Takeaways

SWE-Explore benchmark shows Claude Code, Codex cover only 14-19% of critical lines despite finding the right file.
Model strength doesn't fix the structural weakness.

File-level success, line-level failure

Traditional keyword search barely beats chance—the authors show a bug description like "RuntimeWarning on Overflow" matches templates and docs more often than actual source code. AI agents pull ahead by searching step-by-step rather than sorting all hits at once.

But the moment evaluation zooms from file-level to line-level, the systems fall apart. General coding agents (Claude Code, Codex, OpenHands) plus four research systems designed specifically for code search all land in the same band: 14-19% line coverage. The various agent architectures "land strikingly close to each other," per the paper.

Model strength doesn't fix it

The team ran the same agent architecture with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. GPT-family models lead, but the pattern holds: file hit rates stay high while line coverage remains low. Throwing a stronger language model at the problem doesn't close the gap.

Pipeline diagram of SWE-Explore showing benchmark construction on the left, from solved agent runs through read actions, line regions, and consensus t

This finding echoes a June 4 report that Claude Code quality dropped post-Opus 4.6 with ~25% instruction misses, while Codex 5.3 claimed 95% reliability by the same user. The SWE-Explore results suggest the weakness is structural—agents lack the ability to precisely locate the exact lines that need change, regardless of model or architecture.

The benchmark exposes a blind spot in how AI coding is evaluated. Until now, the field judged agents by whether they fixed the bug or not. SWE-Explore shows that even successful fixes may rely on luck or over-broad context rather than precise understanding.

What to watch

Watch for follow-up work from the same team or competitors that attempts to improve line-level coverage. If Anthropic or OpenAI releases an agent that scores above 30% on SWE-Explore, it would signal a genuine architectural breakthrough rather than a model upgrade.

Side-by-side comparison showing a conventional benchmark on the left with its Explore, Patch, and Verify pipeline producing a single Resolve Rate, and

Source: the-decoder.com

Source: gentic.news · Jun 14, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The SWE-Explore benchmark is a necessary corrective to the field's fixation on end-to-end fix rates. The finding that all tested agents cluster in the 14-19% line-coverage band is the most striking result—it suggests the problem isn't model capability but the architecture of code search itself. Current agents use retrieval-augmented generation (RAG) or simple file-scoping heuristics that work well for coarse localization but fail at the line level. This aligns with the June 4 user reports of Claude Code quality dropping post-Opus 4.6: if the agent's search mechanism is the bottleneck, a smarter model can't compensate. The paper's methodology—using multiple successful solution runs to define ground truth—is clever but introduces a subtle bias: the ground truth is defined by models, not by human expert annotation. However, the manual review step mitigates this somewhat. The practical implication is clear: the next frontier for AI coding agents isn't better code generation but better code comprehension. Expect startups and labs to invest in hierarchical code representations, AST-aware search, or multi-step reasoning that drills from file to function to line.

#code generation #research #ai agents #benchmarks

This story is part of

The Agentic Pivot: How Claude Code Is Forcing a Reconfiguration of the AI Stack

Anthropic's developer tool is becoming the connective tissue between models, infrastructure, and autonomous workflows, challenging OpenAI's application-first strategy.

Compare side-by-side

SWE-Explore vs Codex 5.3

→

Mentioned in this article

SWE-Explore Claude Code Codex 5.3 Shanghai Jiao Tong University OpenHands

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate

Open Source2 shared topics

Aider vs Claude Code: When to Use Each for Terminal-First Development in 2026

Open Source2 shared topics

Harbor Adds LangSmith Sandbox Support, Making Agent Eval Backends Swappable

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog