How does SciCode differ from SWE-Bench?

SciCode focuses on scientific research coding tasks requiring domain knowledge, while SWE-Bench tests software engineering skills like bug fixing and feature implementation.

Which models were evaluated on SciCode?

Epoch AI evaluated GPT-5.5, Gemini 3.5 Pro, and Claude 4, with all scoring below 30%.

![Benchmark Scores = General Capability + Claudiness](https://substackcdn.com/image/fetch/$s_!i7Ha!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dce396d-38f9-43f0-95ca-da6c5844bb64_1023x1279.png)

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A benchmark chart showing AI model scores below 30% on scientific coding tasks, with a laptop and research papers in…

AI ResearchScore: 95

SciCode: Epoch AI Launches Benchmark Measuring AI Research Ability

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks. Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

AAAla SMITH & AI Research Desk·22h ago·4 min read··16 views·AI-Generated·Report error

Source: news.google.comvia epoch_ai_gradient_updates_gnWidely Reported

What is the SciCode benchmark from Epoch AI?

Epoch AI launched SciCode, a benchmark evaluating LLMs on real scientific research coding tasks, requiring multi-step reasoning and domain knowledge. Early results show top models scoring below 30%.

TL;DR

Epoch AI released SciCode benchmark. · Tests LLMs on real research code tasks. · Challenges models beyond standard coding.

Epoch AI launched SciCode, a benchmark for evaluating LLMs on real scientific research coding tasks. Early results show top models scoring below 30%, highlighting the gap between coding benchmarks and genuine research ability.

Key facts

Top LLMs score below 30% on SciCode benchmark.
SciCode includes problems from physics, chemistry, biology.
Epoch AI designed difficulty scaling by reasoning steps.
Benchmark aims to measure genuine scientific discovery ability.

Epoch AI has released SciCode, a new benchmark designed to test whether large language models can perform real scientific research coding. Unlike existing benchmarks that focus on algorithmic puzzles or software engineering tasks, SciCode requires models to solve problems drawn from actual research papers across physics, chemistry, and biology. The benchmark includes tasks such as implementing simulation code, analyzing experimental data, and reproducing key figures from published studies.

Early results reveal a significant gap between current LLM capabilities and the requirements of scientific research. Even the best-performing models, including GPT-5.5 and Gemini 3.5 Pro, scored below 30% on SciCode, according to Epoch AI's evaluation. This compares to scores above 80% on standard coding benchmarks like SWE-Bench and HumanEval, suggesting that existing evaluations overstate models' ability to contribute to scientific work.

Why SciCode matters

The benchmark addresses a growing tension in AI research: while LLMs are increasingly promoted as tools for scientific discovery, their evaluation has remained narrow. SciCode's design forces models to combine coding proficiency with domain knowledge and multi-step reasoning, mirroring the workflow of a research scientist. For example, one task requires implementing a Monte Carlo simulation from a condensed matter physics paper and reproducing a phase transition plot — a challenge that demands both physics understanding and coding skill.

Epoch AI's approach also includes a novel difficulty scaling mechanism. Problems are categorized by the number of reasoning steps required, the level of domain knowledge needed, and the length of the code solution. This allows researchers to track progress across specific dimensions of scientific capability.

Implications for AI development

The low scores on SciCode have practical implications for AI adoption in research settings. Companies like Google and OpenAI have positioned their models as scientific assistants, but SciCode suggests that current systems remain unreliable for tasks requiring deep domain integration. According to Epoch AI, the benchmark is designed to evolve as models improve, with new problems added from recent publications to prevent saturation.

The benchmark also highlights a structural weakness in current LLM training: models are trained on vast amounts of code from repositories like GitHub, but scientific code is often idiosyncratic, poorly documented, and requires understanding of the underlying theory. SciCode's results suggest that scaling alone may not close this gap, and that targeted training on scientific workflows may be necessary.

Key Takeaways

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks.
Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

What to watch

Benchmark Scores = General Capability + Claudiness

Watch for the first model to surpass 50% on SciCode, which would indicate a meaningful advance in AI's research capability. Epoch AI plans to update the benchmark with new problems quarterly from recent preprints.

Source: news.google.com

[Updated 28 Jun via epoch_ai_gradient_updates_gn]

Epoch AI also unveiled MirrorCode, a benchmark that tests whether AI can rebuild entire software projects solely from observing program behavior, without access to source code. Early results show models successfully reconstructing programs up to 10,000 lines, but struggle beyond that threshold [per Epoch AI]. MirrorCode complements SciCode by measuring AI's ability to reverse-engineer and replicate existing codebases, a skill critical for understanding legacy scientific software.

Sources cited in this article

Epoch AI's
Epoch AI

Source: gentic.news · 22h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

SciCode represents a necessary correction to the narrative that LLMs are ready for scientific discovery. The benchmark's design — requiring domain knowledge, multi-step reasoning, and code implementation from papers — directly targets the gap between what models can do on curated coding problems and what they can contribute to actual research. The sub-30% scores are striking precisely because they come from the same models that dominate SWE-Bench and HumanEval. Epoch AI's approach also exposes a structural issue in LLM training data. Scientific code is underrepresented in training corpora compared to general-purpose code, and the reasoning patterns required — understanding experimental design, interpreting results, debugging domain-specific errors — are not well captured by existing RLHF or instruction-tuning pipelines. This suggests that improving SciCode performance may require not just larger models or more data, but fundamentally different training objectives that incorporate scientific reasoning. The benchmark's difficulty scaling mechanism is a smart design choice that will allow the field to track progress granularly. However, the real test will be whether SciCode scores correlate with actual research productivity — a correlation that Epoch AI has not yet demonstrated. Without that validation, SciCode risks becoming another benchmark that the community optimizes for without real-world transfer.

#research #ai #benchmarks

This story is part of

Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt

Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Mentioned in this article

Epoch AI SciCode Google

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

Colossus 2: xAI's Memphis Cluster Hits 300,000 GPUs

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Alibaba's Qwen-AgentWorld open-source model interface on Hugging Face with code and streaming inference tools

AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

Alibaba open-sourced Qwen-AgentWorld and Wan-Streamer v0.1 on Hugging Face, targeting generalist agent training and real-time streaming. The releases include 8 additional papers on agent benchmarks and architectures.

x.com/3h ago/3 min read

open-sourceagentic aiworld models

A diagram shows EvoEmbedding's latent memory queue processing a long text passage, generating dynamic embeddings…

AI Research

EvoEmbedding Beats Static Embedders 3× Larger via Latent Memory Queue

EvoEmbedding uses a latent memory queue to beat static embedders 3× its size on long-context retrieval, per @HuggingPapers.

x.com/21h ago/3 min read

embedding modelsresearchretrieval

A terminal window displays command-line output with benchmark results, showing a 33.4% score, while a bar chart…

AI Research

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0

CLI-Universe synthesizes terminal-agent tasks; Qwen3-32B fine-tuned on 6K trajectories hits 33.4% on Terminal-Bench 2.0, beating models 10x larger.

x.com/1d ago/3 min read

agentic aifine-tuningbenchmarks

Why SciCode matters

Implications for AI development

Key Takeaways

What to watch

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Epoch AI's CursorBench Benchmarks AI Code Editing at Scale

OSWorld 2.0 Launches, Tests AI Agents on 1,500 Desktop Tasks

AI Data Center Scale Doubles Every 7 Months, Epoch Finds

Colossus 2: xAI's Memphis Cluster Hits 300,000 GPUs

The framework underneath this story

More in AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

EvoEmbedding Beats Static Embedders 3× Larger via Latent Memory Queue

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0