Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A benchmark chart showing AI model scores below 30% on scientific coding tasks, with a laptop and research papers in…
AI ResearchScore: 95

SciCode: Epoch AI Launches Benchmark Measuring AI Research Ability

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks. Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

·22h ago·4 min read··16 views·AI-Generated·Report error
Share:
Source: news.google.comvia epoch_ai_gradient_updates_gnWidely Reported
What is the SciCode benchmark from Epoch AI?

Epoch AI launched SciCode, a benchmark evaluating LLMs on real scientific research coding tasks, requiring multi-step reasoning and domain knowledge. Early results show top models scoring below 30%.

TL;DR

Epoch AI released SciCode benchmark. · Tests LLMs on real research code tasks. · Challenges models beyond standard coding.

Epoch AI launched SciCode, a benchmark for evaluating LLMs on real scientific research coding tasks. Early results show top models scoring below 30%, highlighting the gap between coding benchmarks and genuine research ability.

Key facts

  • Top LLMs score below 30% on SciCode benchmark.
  • SciCode includes problems from physics, chemistry, biology.
  • Epoch AI designed difficulty scaling by reasoning steps.
  • Benchmark aims to measure genuine scientific discovery ability.

Epoch AI has released SciCode, a new benchmark designed to test whether large language models can perform real scientific research coding. Unlike existing benchmarks that focus on algorithmic puzzles or software engineering tasks, SciCode requires models to solve problems drawn from actual research papers across physics, chemistry, and biology. The benchmark includes tasks such as implementing simulation code, analyzing experimental data, and reproducing key figures from published studies.

Early results reveal a significant gap between current LLM capabilities and the requirements of scientific research. Even the best-performing models, including GPT-5.5 and Gemini 3.5 Pro, scored below 30% on SciCode, according to Epoch AI's evaluation. This compares to scores above 80% on standard coding benchmarks like SWE-Bench and HumanEval, suggesting that existing evaluations overstate models' ability to contribute to scientific work.

Why SciCode matters

The benchmark addresses a growing tension in AI research: while LLMs are increasingly promoted as tools for scientific discovery, their evaluation has remained narrow. SciCode's design forces models to combine coding proficiency with domain knowledge and multi-step reasoning, mirroring the workflow of a research scientist. For example, one task requires implementing a Monte Carlo simulation from a condensed matter physics paper and reproducing a phase transition plot — a challenge that demands both physics understanding and coding skill.

Epoch AI's approach also includes a novel difficulty scaling mechanism. Problems are categorized by the number of reasoning steps required, the level of domain knowledge needed, and the length of the code solution. This allows researchers to track progress across specific dimensions of scientific capability.

Implications for AI development

The low scores on SciCode have practical implications for AI adoption in research settings. Companies like Google and OpenAI have positioned their models as scientific assistants, but SciCode suggests that current systems remain unreliable for tasks requiring deep domain integration. According to Epoch AI, the benchmark is designed to evolve as models improve, with new problems added from recent publications to prevent saturation.

The benchmark also highlights a structural weakness in current LLM training: models are trained on vast amounts of code from repositories like GitHub, but scientific code is often idiosyncratic, poorly documented, and requires understanding of the underlying theory. SciCode's results suggest that scaling alone may not close this gap, and that targeted training on scientific workflows may be necessary.

Key Takeaways

  • Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks.
  • Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

What to watch

Benchmark Scores = General Capability + Claudiness

Watch for the first model to surpass 50% on SciCode, which would indicate a meaningful advance in AI's research capability. Epoch AI plans to update the benchmark with new problems quarterly from recent preprints.


Source: news.google.com

[Updated 28 Jun via epoch_ai_gradient_updates_gn]

Epoch AI also unveiled MirrorCode, a benchmark that tests whether AI can rebuild entire software projects solely from observing program behavior, without access to source code. Early results show models successfully reconstructing programs up to 10,000 lines, but struggle beyond that threshold [per Epoch AI]. MirrorCode complements SciCode by measuring AI's ability to reverse-engineer and replicate existing codebases, a skill critical for understanding legacy scientific software.


Sources cited in this article

  1. Epoch AI's
  2. Epoch AI
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

SciCode represents a necessary correction to the narrative that LLMs are ready for scientific discovery. The benchmark's design — requiring domain knowledge, multi-step reasoning, and code implementation from papers — directly targets the gap between what models can do on curated coding problems and what they can contribute to actual research. The sub-30% scores are striking precisely because they come from the same models that dominate SWE-Bench and HumanEval. Epoch AI's approach also exposes a structural issue in LLM training data. Scientific code is underrepresented in training corpora compared to general-purpose code, and the reasoning patterns required — understanding experimental design, interpreting results, debugging domain-specific errors — are not well captured by existing RLHF or instruction-tuning pipelines. This suggests that improving SciCode performance may require not just larger models or more data, but fundamentally different training objectives that incorporate scientific reasoning. The benchmark's difficulty scaling mechanism is a smart design choice that will allow the field to track progress granularly. However, the real test will be whether SciCode scores correlate with actual research productivity — a correlation that Epoch AI has not yet demonstrated. Without that validation, SciCode risks becoming another benchmark that the community optimizes for without real-world transfer.
This story is part of
Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt
Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all