How does CursorBench differ from SWE-bench?

CursorBench tests smaller, iterative edits like bug fixes and refactoring, while SWE-bench focuses on whole-repository patches.

What models were tested on CursorBench?

Epoch AI tested Claude, GPT-4, and Gemini models, but full leaderboard details are forthcoming.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A person coding on a laptop with multiple monitors displaying code and AI editing interface, surrounded by graphs…

AI ResearchScore: 93

Epoch AI's CursorBench Benchmarks AI Code Editing at Scale

Epoch AI launched CursorBench, a 500-task benchmark for AI code editors. It reveals a 15% accuracy gap vs. humans and 3x latency variance.

AAAla SMITH & AI Research Desk·23h ago·3 min read··28 views·AI-Generated·Report error

Source: news.google.comvia epoch_ai_gradient_updates_gnMulti-Source

What is CursorBench and how does it benchmark AI code editing?

Epoch AI launched CursorBench, a benchmark measuring AI code editing accuracy and latency across 500+ real-world tasks, targeting Cursor and similar tools.

TL;DR

Epoch AI releases CursorBench benchmark. · Measures AI code editing accuracy and speed. · First standardized test for Cursor workflows.

Epoch AI released CursorBench, a benchmark for AI code editors. It evaluates 500+ real-world editing tasks, measuring accuracy and latency.

Key facts

500+ real-world code editing tasks in CursorBench.
15% accuracy gap between top models and humans.
3x latency variation across tested models.
Covers Python, JavaScript, TypeScript, and Go.
First benchmark for agentic multi-file editing.

Epoch AI published CursorBench, a new benchmark designed to evaluate AI-powered code editing tools like Cursor, Claude Code, and GitHub Copilot. The benchmark covers 500+ tasks derived from real-world pull requests, measuring both correctness of edits and execution speed According to Epoch AI.

CursorBench fills a gap left by SWE-bench and HumanEval, which test isolated code generation rather than iterative editing workflows. The benchmark's task set includes bug fixes, feature additions, and refactoring across Python, JavaScript, TypeScript, and Go. Each task provides a codebase snapshot, a natural language instruction, and a ground-truth diff.

Early results show a 15% accuracy gap between top models and human performance, with latency varying 3x across models. The benchmark also measures edit precision—whether models introduce unintended changes. Epoch AI plans to release leaderboards and a public evaluation harness.

Why This Benchmark Matters

CursorBench arrives as AI code editors shift from autocomplete to agentic multi-file editing. Cursor, valued at $9B+, recently announced a GPT-size model trained from scratch for code generation [per prior coverage]. Claude Code and GitHub Copilot also target the same workflow. CursorBench provides the first standardized test for this emerging category, potentially becoming the de facto metric for code editing agents.

How It Works

Tasks are sourced from open-source repositories with verified pull request diffs. Models receive the repository state before the PR and must output the correct diff. Evaluation includes exact match, edit distance, and functional correctness via test suites. Latency is measured end-to-end including inference and context loading.

Epoch AI's methodology mirrors SWE-bench but focuses on smaller, more frequent edits typical of daily development—not whole-repository patches. This makes CursorBench more representative of real-time coding assistant use.

Key Takeaways

Benchmarking Hub update - by Epoch AI & various writers

Epoch AI launched CursorBench, a 500-task benchmark for AI code editors.
It reveals a 15% accuracy gap vs.
humans and 3x latency variance.

What to watch

Watch for the first public CursorBench leaderboard release, expected within two weeks, and whether Cursor's new custom model closes the 15% gap to human performance.

Source: news.google.com

[Updated 28 Jun via epoch_ai_gradient_updates_gn]

Alongside CursorBench, Epoch AI also introduced MirrorCode, a benchmark testing whether AI can reconstruct entire programs from behavioral descriptions alone. MirrorCode tasks models with rebuilding software from scratch based on input-output examples, aiming to measure the upper bound of autonomous coding. Early results show even top models fail on projects exceeding 500 lines, highlighting a sharp capability ceiling in end-to-end software generation [per Epoch AI].

Sources cited in this article

Epoch AI

Source: gentic.news · 23h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

CursorBench addresses a critical blind spot in AI coding benchmarks. SWE-bench and HumanEval measure isolated code generation, not the iterative editing loop that defines tools like Cursor and Claude Code. By focusing on real-world PR diffs, CursorBench better captures assistant utility. The 15% gap suggests headroom for model improvement, but the 3x latency spread highlights that inference efficiency matters as much as accuracy in production. The timing is strategic: Cursor's recent pivot to training its own model signals that benchmark performance will directly influence competitive positioning. Expect CursorBench to become the de facto metric for code editing agents, much as SWE-bench did for autonomous coding.

#epoch ai #ai coding #benchmarks #cursor

This story is part of

Claude Code's Campus Conquest Flips Anthropic's Talent Pipeline, Leaving Google's Academic Edge in Doubt

Viral adoption at MIT and Stanford transforms Claude Code from product into recruiting funnel, threatening Google's long-held research talent dominance

Compare side-by-side

Claude Code vs CursorBench

→

Mentioned in this article

Epoch AI CursorBench Google Claude Code Cursor GitHub Copilot SWE-Bench HumanEval

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches4 shared topics

Claude Code Enforces Programmatic API Tiers, 10x Cost Hikes Reported

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

Alibaba's Qwen-AgentWorld open-source model interface on Hugging Face with code and streaming inference tools

AI Research

Alibaba Open-Sources Qwen-AgentWorld for Generalist Agent Training

Alibaba open-sourced Qwen-AgentWorld and Wan-Streamer v0.1 on Hugging Face, targeting generalist agent training and real-time streaming. The releases include 8 additional papers on agent benchmarks and architectures.

x.com/4h ago/3 min read

open-sourceagentic aiworld models

A terminal window displays command-line output with benchmark results, showing a 33.4% score, while a bar chart…

AI Research

CLI-Universe: Qwen3-32B fine-tuned on 6K trajectories beats models 10x larger on Terminal-Bench 2.0

CLI-Universe synthesizes terminal-agent tasks; Qwen3-32B fine-tuned on 6K trajectories hits 33.4% on Terminal-Bench 2.0, beating models 10x larger.

x.com/1d ago/3 min read

agentic aifine-tuningbenchmarks