Anthropic Opus 4.7: 87.6% SWE-Bench, Constrained Cyber Capabilities

Anthropic released Claude Opus 4.7 on April 16, 2026, achieving 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro — leading GPT-5.4 and Gemini 3.1 Pro. The company also confirmed it deliberately constrained cybersecurity capabilities in Opus 4.7, with the more powerful Mythos Preview model (83.1% on CyberGym) restricted to select partners.

GAla Smith & AI Research Desk·15h ago·7 min read·13 views·AI-Generated·Report error

Source: pub.towardsai.netvia towards_aiCorroborated

Anthropic released Claude Opus 4.7 on April 16, 2026 — its most capable generally available model to date. The headline numbers are strong: 87.6% on SWE-Bench Verified, 64.3% on SWE-Bench Pro, and a 70% score on CursorBench. But the more interesting story is what Anthropic said around the edges of the launch.

Key Takeaways

Anthropic released Claude Opus 4.7 on April 16, 2026, achieving 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro — leading GPT-5.4 and Gemini 3.1 Pro.
The company also confirmed it deliberately constrained cybersecurity capabilities in Opus 4.7, with the more powerful Mythos Preview model (83.1% on CyberGym) restricted to select partners.

What's New

Opus 4.7 delivers benchmark-leading performance in software engineering and agentic reasoning. The model is available through the Anthropic API and Claude Code, with pricing unchanged at $5 per million input tokens and $25 per million output tokens.

Key Benchmarks

SWE-Bench Verified 87.6% 80.8% — — SWE-Bench Pro 64.3% 53.4% 57.7% 54.2% CursorBench 70% 58% — — Terminal-Bench 2.0 69.4% — 75.1% — CyberGym 73.1% 73.8% — — GPQA Diamond 94.2% — 94.4% 94.3%

Behavioral Improvements

Opus 4.7 delivers a 14% improvement over Opus 4.6 on complex multi-step workflows while producing a third of the tool errors. It is the first Claude model to pass what Anthropic calls "implicit-need tests" — tasks where the model must infer what tools or actions are required rather than being told explicitly.

Self-verification is another behavioral shift: the model checks its own work before reporting back. For developers who've spent sessions discovering that "I've implemented the change" meant something other than what they expected, this is a significant quality-of-life improvement.

Vision Upgrade

Resolution jumps from 1.15 megapixels to 3.75 megapixels — more than three times the image detail of earlier Claude models. Screenshots, dense diagrams, design mockups, and documents with fine labels all come through at actual fidelity now. For computer-use agents that depend on reading UI states, this is a different model.

New Effort Level: xhigh

The new xhigh effort level sits between high and max — designed as the sweet spot for serious coding and agentic work without the full latency cost of max. Claude Code now defaults to xhigh across all plans.

The Deliberately Constrained Model

Here's the part that deserves more attention: Anthropic is explicit that Claude Mythos Preview remains its most powerful model. Opus 4.7 is the first model to receive new cybersecurity safeguards designed as a prerequisite for eventually releasing Mythos-class capability more broadly. During its training, Anthropic experimented with efforts to differentially reduce its cybersecurity capabilities — deliberately pulling them back relative to what the underlying capability would otherwise have been.

On CyberGym, Opus 4.7 scores 73.1% — effectively flat against Opus 4.6's revised 73.8%. This flat line is a policy choice, not a capability ceiling. Mythos Preview scores 83.1% on the same benchmark but remains restricted to vetted Glasswing partners.

That's a deliberate gap between "what this model could do" and "what Anthropic is letting it do." Not because the capability isn't there, but because the deployment infrastructure for that capability isn't ready yet.

Opus 4.7 is the test bed for that infrastructure: automatic detection and blocking of prohibited cybersecurity uses, the Cyber Verification Program for legitimate security professionals, and the high effort level as a gate for more powerful reasoning. These are the safety systems that must prove themselves on Opus 4.7 before Anthropic will consider deploying them on a Mythos-class model.

One Caveat

Terminal-Bench 2.0 is a regression — GPT-5.4 scores 75.1% there versus Opus 4.7's 69.4%. BrowseComp also softens compared to Opus 4.6. If those specific workflows matter to your use case, it's worth knowing.

What This Means in Practice

For developers building on Claude Code, the upgrade case is straightforward: better coding benchmarks, same price, and behavioral improvements that reduce drift and tool errors in long sessions. For enterprise teams running agentic coding workflows, the 3x improvement in production task resolution is the number that matters most — it's measured on real tasks, not synthetic benchmarks.

The instruction-following upgrade comes with a migration note: Opus 4.7 takes prompts precisely and literally, where Opus 4.6 interpreted instructions loosely and sometimes skipped steps. Anthropic recommends retuning prompts and harnesses when moving from 4.6 to 4.7.

gentic.news Analysis

This launch is notable not just for the benchmark numbers, but for what it reveals about Anthropic's deployment strategy. The company is building a model deployment ladder — each rung is tested for safety properties before the next rung gets released. Opus 4.7 is the rung below Mythos, inheriting some capabilities while deliberately constraining others.

This follows a pattern we've seen across the industry: as frontier models become more capable, companies are becoming more explicit about capability gating. OpenAI has discussed similar approaches with GPT-5.4, and Google DeepMind has published on safety architectures for Gemini 3.1. The difference here is that Anthropic is being unusually transparent about the constraint — most companies would just ship the model without explaining the safety architecture behind it.

The timing is also interesting. Anthropic is running at a $30 billion annualized revenue rate and is in early IPO talks. The company is not being cautious about Mythos due to a lack of commercial incentive to release it. The constraint is genuine — and the Opus 4.7 launch is the company's acknowledgment that deploying dangerous capability safely requires infrastructure you build and test before you need it, not after.

For practitioners, the key takeaway is that the frontier race has shifted. GPQA Diamond scores are within 0.2% across three frontier models — pure reasoning is saturated. The differentiation is now in applied performance on complex, multi-step tasks: coding, tool use, production task resolution, long-running agentic workflows. That's a more useful competition to win, and harder to fake in benchmarks.

Frequently Asked Questions

How does Claude Opus 4.7 compare to GPT-5.4?

Opus 4.7 leads on SWE-Bench Pro (64.3% vs 57.7%) and CursorBench (70% vs unknown), while GPT-5.4 leads on Terminal-Bench 2.0 (75.1% vs 69.4%). GPQA Diamond scores are nearly identical (94.2% vs 94.4%). Opus 4.7 is the better choice for coding and agentic tasks; GPT-5.4 may be better for terminal-based workflows.

What is Claude Mythos Preview and when will it be released?

Claude Mythos Preview is Anthropic's most powerful model, scoring 83.1% on CyberGym compared to Opus 4.7's 73.1%. It is currently restricted to vetted Glasswing partners. Anthropic has not announced a general release date, stating that Opus 4.7 must first prove the safety infrastructure needed for Mythos-class capability.

What is the xhigh effort level in Opus 4.7?

xhigh is a new effort level between high and max, designed as the sweet spot for serious coding and agentic work without the full latency cost of max. Claude Code now defaults to xhigh across all plans, signaling Anthropic's recommendation for production use.

Does Opus 4.7 have any weaknesses?

Yes. Terminal-Bench 2.0 is a regression (69.4% vs GPT-5.4's 75.1%), and BrowseComp scores softened compared to Opus 4.6. The instruction-following upgrade also means prompts may need retuning — Opus 4.7 takes instructions literally, where Opus 4.6 was more flexible.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The deliberate capability constraint on cybersecurity is the most significant signal in this launch. Anthropic is effectively admitting that Opus 4.7's CyberGym score of 73.1% is not a capability ceiling but a policy floor — the model could have scored higher, but the company chose to differentially reduce cybersecurity capabilities during training. This is a novel approach to safety that goes beyond RLHF or refusal training, which typically operate at inference time rather than during training. If this becomes standard practice, it would represent a fundamental shift in how frontier models are built: capability gating moves from the API layer into the model weights themselves. The saturated GPQA Diamond scores (within 0.2% across three frontier models) confirm that the frontier race has entered a new phase. Pure reasoning benchmarks are no longer differentiating — the competition is now about applied performance on complex, multi-step tasks. This makes sense from a scaling perspective: as models approach human-level performance on reasoning benchmarks, further gains require architectural innovations or training data improvements rather than simple parameter scaling. The 12-point jump on CursorBench, which measures autonomous coding in an actual IDE, is more meaningful than a 2-point improvement on a synthetic benchmark. Anthropic's transparency about the constraint architecture is unusual and strategically interesting. The company is running at $30 billion annualized revenue and in early IPO talks — it has strong commercial incentives to ship Mythos. By being explicit about the safety ladder, Anthropic is likely signaling to regulators and enterprise buyers that it takes deployment safety seriously, potentially positioning itself favorably in future regulatory frameworks. It's also a competitive signal: by acknowledging that Mythos exists and is more capable, Anthropic is telling the market that it has more capability in reserve than it's currently shipping.

#claude #coding #anthropic #safety #ai models

Mentioned in this article

Anthropic Claude Opus 4.6 GPT-5.3 Claude Opus 4.7 Gemini 3 Pro Claude Mythos Preview SWE-bench Verified SWE-Bench Pro Terminal Bench 2 CyberGym

Enjoyed this article?