SWE-Explore tests 848 bug-fixing tasks across 203 open-source projects. Claude Code, Codex 5.3, and OpenHands all find the right file but cover only 14-19% of critical lines.
Key facts
- SWE-Explore: 848 problems from 203 open-source projects.
- Claude Code, Codex cover only 14-19% of critical lines.
- Python dominates with 547 of 848 tasks.
- File hit rates stay high; line-level accuracy collapses.
- Six different models tested; pattern holds across all.
An international research team led by Shanghai Jiao Tong University released SWE-Explore, a benchmark that isolates code search from the actual repair phase. The core finding: AI coding agents reliably identify the correct source file, but their line-level coverage collapses to 14-19% of the lines that matter. According to the source
The benchmark uses 848 problems from 203 open-source projects across 10 languages (Python dominates with 547 tasks, followed by Go, JavaScript, and Rust). For each problem, at least two successful solution runs from models like GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, or Kimi K2.6 establish the ground-truth set of relevant code sections. Passages that multiple independent solution paths converge on are marked as critical context.
Key Takeaways
- SWE-Explore benchmark shows Claude Code, Codex cover only 14-19% of critical lines despite finding the right file.
- Model strength doesn't fix the structural weakness.
File-level success, line-level failure
Traditional keyword search barely beats chance—the authors show a bug description like "RuntimeWarning on Overflow" matches templates and docs more often than actual source code. AI agents pull ahead by searching step-by-step rather than sorting all hits at once.
But the moment evaluation zooms from file-level to line-level, the systems fall apart. General coding agents (Claude Code, Codex, OpenHands) plus four research systems designed specifically for code search all land in the same band: 14-19% line coverage. The various agent architectures "land strikingly close to each other," per the paper.
Model strength doesn't fix it
The team ran the same agent architecture with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. GPT-family models lead, but the pattern holds: file hit rates stay high while line coverage remains low. Throwing a stronger language model at the problem doesn't close the gap.

This finding echoes a June 4 report that Claude Code quality dropped post-Opus 4.6 with ~25% instruction misses, while Codex 5.3 claimed 95% reliability by the same user. The SWE-Explore results suggest the weakness is structural—agents lack the ability to precisely locate the exact lines that need change, regardless of model or architecture.
The benchmark exposes a blind spot in how AI coding is evaluated. Until now, the field judged agents by whether they fixed the bug or not. SWE-Explore shows that even successful fixes may rely on luck or over-broad context rather than precise understanding.
What to watch
Watch for follow-up work from the same team or competitors that attempts to improve line-level coverage. If Anthropic or OpenAI releases an agent that scores above 30% on SWE-Explore, it would signal a genuine architectural breakthrough rather than a model upgrade.

Source: the-decoder.com








