Anthropic released Claude Opus 4.7 on April 16, 2026 — its most capable generally available model to date. The headline numbers are strong: 87.6% on SWE-Bench Verified, 64.3% on SWE-Bench Pro, and a 70% score on CursorBench. But the more interesting story is what Anthropic said around the edges of the launch.
Key Takeaways
- Anthropic released Claude Opus 4.7 on April 16, 2026, achieving 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro — leading GPT-5.4 and Gemini 3.1 Pro.
- The company also confirmed it deliberately constrained cybersecurity capabilities in Opus 4.7, with the more powerful Mythos Preview model (83.1% on CyberGym) restricted to select partners.
What's New
Opus 4.7 delivers benchmark-leading performance in software engineering and agentic reasoning. The model is available through the Anthropic API and Claude Code, with pricing unchanged at $5 per million input tokens and $25 per million output tokens.
Key Benchmarks
SWE-Bench Verified 87.6% 80.8% — — SWE-Bench Pro 64.3% 53.4% 57.7% 54.2% CursorBench 70% 58% — — Terminal-Bench 2.0 69.4% — 75.1% — CyberGym 73.1% 73.8% — — GPQA Diamond 94.2% — 94.4% 94.3%Behavioral Improvements
Opus 4.7 delivers a 14% improvement over Opus 4.6 on complex multi-step workflows while producing a third of the tool errors. It is the first Claude model to pass what Anthropic calls "implicit-need tests" — tasks where the model must infer what tools or actions are required rather than being told explicitly.
Self-verification is another behavioral shift: the model checks its own work before reporting back. For developers who've spent sessions discovering that "I've implemented the change" meant something other than what they expected, this is a significant quality-of-life improvement.
Vision Upgrade
Resolution jumps from 1.15 megapixels to 3.75 megapixels — more than three times the image detail of earlier Claude models. Screenshots, dense diagrams, design mockups, and documents with fine labels all come through at actual fidelity now. For computer-use agents that depend on reading UI states, this is a different model.
New Effort Level: xhigh
The new xhigh effort level sits between high and max — designed as the sweet spot for serious coding and agentic work without the full latency cost of max. Claude Code now defaults to xhigh across all plans.
The Deliberately Constrained Model
Here's the part that deserves more attention: Anthropic is explicit that Claude Mythos Preview remains its most powerful model. Opus 4.7 is the first model to receive new cybersecurity safeguards designed as a prerequisite for eventually releasing Mythos-class capability more broadly. During its training, Anthropic experimented with efforts to differentially reduce its cybersecurity capabilities — deliberately pulling them back relative to what the underlying capability would otherwise have been.
On CyberGym, Opus 4.7 scores 73.1% — effectively flat against Opus 4.6's revised 73.8%. This flat line is a policy choice, not a capability ceiling. Mythos Preview scores 83.1% on the same benchmark but remains restricted to vetted Glasswing partners.
That's a deliberate gap between "what this model could do" and "what Anthropic is letting it do." Not because the capability isn't there, but because the deployment infrastructure for that capability isn't ready yet.
Opus 4.7 is the test bed for that infrastructure: automatic detection and blocking of prohibited cybersecurity uses, the Cyber Verification Program for legitimate security professionals, and the high effort level as a gate for more powerful reasoning. These are the safety systems that must prove themselves on Opus 4.7 before Anthropic will consider deploying them on a Mythos-class model.
One Caveat
Terminal-Bench 2.0 is a regression — GPT-5.4 scores 75.1% there versus Opus 4.7's 69.4%. BrowseComp also softens compared to Opus 4.6. If those specific workflows matter to your use case, it's worth knowing.
What This Means in Practice
For developers building on Claude Code, the upgrade case is straightforward: better coding benchmarks, same price, and behavioral improvements that reduce drift and tool errors in long sessions. For enterprise teams running agentic coding workflows, the 3x improvement in production task resolution is the number that matters most — it's measured on real tasks, not synthetic benchmarks.
The instruction-following upgrade comes with a migration note: Opus 4.7 takes prompts precisely and literally, where Opus 4.6 interpreted instructions loosely and sometimes skipped steps. Anthropic recommends retuning prompts and harnesses when moving from 4.6 to 4.7.
gentic.news Analysis
This launch is notable not just for the benchmark numbers, but for what it reveals about Anthropic's deployment strategy. The company is building a model deployment ladder — each rung is tested for safety properties before the next rung gets released. Opus 4.7 is the rung below Mythos, inheriting some capabilities while deliberately constraining others.
This follows a pattern we've seen across the industry: as frontier models become more capable, companies are becoming more explicit about capability gating. OpenAI has discussed similar approaches with GPT-5.4, and Google DeepMind has published on safety architectures for Gemini 3.1. The difference here is that Anthropic is being unusually transparent about the constraint — most companies would just ship the model without explaining the safety architecture behind it.
The timing is also interesting. Anthropic is running at a $30 billion annualized revenue rate and is in early IPO talks. The company is not being cautious about Mythos due to a lack of commercial incentive to release it. The constraint is genuine — and the Opus 4.7 launch is the company's acknowledgment that deploying dangerous capability safely requires infrastructure you build and test before you need it, not after.
For practitioners, the key takeaway is that the frontier race has shifted. GPQA Diamond scores are within 0.2% across three frontier models — pure reasoning is saturated. The differentiation is now in applied performance on complex, multi-step tasks: coding, tool use, production task resolution, long-running agentic workflows. That's a more useful competition to win, and harder to fake in benchmarks.
Frequently Asked Questions
How does Claude Opus 4.7 compare to GPT-5.4?
Opus 4.7 leads on SWE-Bench Pro (64.3% vs 57.7%) and CursorBench (70% vs unknown), while GPT-5.4 leads on Terminal-Bench 2.0 (75.1% vs 69.4%). GPQA Diamond scores are nearly identical (94.2% vs 94.4%). Opus 4.7 is the better choice for coding and agentic tasks; GPT-5.4 may be better for terminal-based workflows.
What is Claude Mythos Preview and when will it be released?
Claude Mythos Preview is Anthropic's most powerful model, scoring 83.1% on CyberGym compared to Opus 4.7's 73.1%. It is currently restricted to vetted Glasswing partners. Anthropic has not announced a general release date, stating that Opus 4.7 must first prove the safety infrastructure needed for Mythos-class capability.
What is the xhigh effort level in Opus 4.7?
xhigh is a new effort level between high and max, designed as the sweet spot for serious coding and agentic work without the full latency cost of max. Claude Code now defaults to xhigh across all plans, signaling Anthropic's recommendation for production use.
Does Opus 4.7 have any weaknesses?
Yes. Terminal-Bench 2.0 is a regression (69.4% vs GPT-5.4's 75.1%), and BrowseComp scores softened compared to Opus 4.6. The instruction-following upgrade also means prompts may need retuning — Opus 4.7 takes instructions literally, where Opus 4.6 was more flexible.









