A new paper (arXiv 2605.15184) shows grep-style search beats vector retrieval across every harness-model pair on LongMemEval. The finding undermines the default assumption that every serious agent stack needs embeddings.
Key facts
- Paper: arXiv 2605.15184
- Benchmark: LongMemEval
- Inline grep beats inline vector across all harness-model pairs.
- Grep wins on evidence-location tasks (names, dates, file paths).
- Retrieval method performance depends on agent harness design.
The paper, "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search," compares grep and vector retrieval on LongMemEval, where agents recover facts from long conversation histories full of distractors. [According to @rohanpaul_ai's summary] Inline grep beats inline vector across every harness-model pair in their main experiment, sometimes by wide margins.
The surprising result is not that grep is powerful, but that agent design makes it powerful. The paper says not that grep beats vectors, but that agents fail or win through their harness.
Why Grep Wins for Evidence-Location

When the answer is anchored in literal evidence—names, dates, file paths, function names, error strings, user preferences—grep gives the model a clean mechanical advantage. Embeddings are built to tolerate paraphrase, but tolerance has a cost: they can pull in semantically nearby clutter, especially when a short agent query is vague. Grep has the opposite failure mode: dumb, cheap, and narrow, but when the agent knows the right string to hunt for, dumb becomes a feature.
The unique take here is that retrieval is not a component you can benchmark in isolation. The same search method behaves differently depending on whether results are injected inline, written to files, routed through a CLI, or wrapped in a custom agent loop.
What This Means for Agent Architecture
For coding agents, a surprising amount of work is evidence-location: find the symbol, trace the call, inspect the diff, read the failing test, recover the exact line. Vectors still matter at scale and for fuzzy conceptual search, but this paper weakens the lazy default that every serious agent stack begins with embeddings. Sometimes the upgrade is not a smarter index—it is giving the model primitive tools, clean files, disciplined context, and a harness that lets exact search do exact work.
What to watch
Watch for follow-up ablation studies on agent harness design versus retrieval method, and whether companies like LangChain or Anthropic incorporate exact-search primitives into their agent frameworks in H2 2026.









