Anthropic research shows AI agents returned 5–106 Ebola sequences instead of the expected 266, altering scientific conclusions. The study, posted on X by @rohanpaul_ai, highlights how biology databases break current agent workflows.
Key facts
- Claude Sonnet 4 returned 5–106 sequences vs expected 266.
- Missing sequences shifted outbreak origin from 2014 to 1922.
- Repeatable retrieval tool dramatically improved accuracy.
- Agents varied 20x on same prompt with no changes.
Anthropic's latest research, shared on X by @rohanpaul_ai, reveals a critical blind spot in AI agent reliability: biology data retrieval. In one Ebola sequence task, Claude Sonnet 4 returned 106 sequences in one run, then 15, then 5, while the expected answer was 266 According to @rohanpaul_ai. Those missing sequences did not just make the dataset messy, they changed the scientific story built on top of it.
One bad retrieval made the outbreak look like it traced back to 1922, instead of the manually curated result pointing to early 2014. The agents often understood what they were being asked, but their answers varied a lot because they had to fight through scattered databases, hidden website rules, and fragile scripts. The biology databases were too hard to use reliably through current AI tools.
The key finding is that adding a repeatable retrieval tool made agents far more accurate and much more consistent. This suggests the problem is not agent intelligence but tooling that can handle the messy, non-standardized interfaces of scientific databases.
Why this matters
This is not a niche biology bug. It reveals a structural failure in how AI agents interact with web-scale data. If a top-tier model like Claude Sonnet 4 can vary by 20x on a simple sequence count, then any agent pipeline dependent on database queries — from drug discovery to climate modeling — is vulnerable to silent, catastrophic variance. The 1922 versus 2014 origin shift is a concrete example of how retrieval noise rewrites scientific narratives.
The research echoes broader agent reliability concerns. In coding benchmarks like SWE-Bench, agents have shown high pass rates, but those benchmarks use clean, version-controlled repositories. Biology databases are the opposite: inconsistent APIs, hidden rate limits, and undocumented schema changes. The gap between toy benchmarks and real-world data is the story here.
Key Takeaways
- Anthropic research shows Claude Sonnet 4 returning 5–106 Ebola sequences instead of 266, shifting outbreak origin from 2014 to 1922.
- Repeatable retrieval tool fixes the variance.
What to watch
Watch for Anthropic to release a tooling patch or API wrapper specifically for biology databases. If they open-source the repeatable retrieval tool, it could become a standard for agent-database interaction. Also monitor whether other labs (OpenAI, Google DeepMind) replicate these findings with their own models.







