DeepSeek-R1 Matches Claude 3.5 Sonnet on SWE-Bench Verified Coding Benchmark
What Happened
On May 27, 2025, AI researcher Rohan Paul shared a tweet with the provocative statement "The era of human coding is over" linking to a report about DeepSeek's new R1 model. The source material shows DeepSeek-R1 achieved 79.8% on the SWE-Bench Verified benchmark, matching the performance of Anthropic's Claude 3.5 Sonnet.
SWE-Bench Verified is a challenging benchmark that tests AI models' ability to solve real-world software engineering problems drawn from actual GitHub issues. The "Verified" suffix indicates solutions are automatically tested against the original repository's test suite, making it a rigorous measure of practical coding ability.
Context
DeepSeek is a Chinese AI company that has been releasing increasingly capable open-source models. Their previous model, DeepSeek-Coder-V2, was already competitive on coding benchmarks. The R1 represents their latest reasoning-focused model, specifically designed to tackle complex problem-solving tasks like software engineering.
Claude 3.5 Sonnet, released by Anthropic in June 2024, has been considered one of the top-performing models for coding tasks, particularly on SWE-Bench where it set a high bar for performance.
The Benchmark Result
The 79.8% score on SWE-Bench Verified puts DeepSeek-R1 in elite company. While the source doesn't provide a full comparison table, it specifically notes the match with Claude 3.5 Sonnet, suggesting this represents state-of-the-art or near state-of-the-art performance.
What makes this result notable:
- Real-world relevance: SWE-Bench problems come from actual GitHub repositories, not synthetic coding challenges
- Verification rigor: Solutions must pass the original project's test suite
- Practical difficulty: Problems often require understanding complex codebases and making minimal, correct changes
What We Don't Know
The source material is limited to this single benchmark result. We don't have:
- Model size or architecture details
- Training methodology or datasets
- Performance on other coding benchmarks (HumanEval, MBPP, etc.)
- Availability (open-source vs. API access)
- Pricing or deployment details
Given the thin source material, this article focuses only on what's verifiable from the provided information.





