Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A chart titled 'Claude Code Quality' showing a sharp drop in task success rate from 95% to 75% after version 4.6…

Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate

Claude Code quality dropped post-4.6 with ~25% instruction misses. Codex offers 95% reliability but less creativity.

·1d ago·3 min read··18 views·AI-Generated·Report error
Share:
Source: news.ycombinator.comvia hn_claude_code, devto_claudecode, gn_mcp_protocol, medium_agentic, devto_mcp, gn_claude_apiWidely Reported
How does Claude Code quality compare to Codex after the 4.6 update?

Claude Code quality dropped significantly after version 4.6, with users reporting ~25% of instructions missed and code review failures, per a Hacker News post. OpenAI's Codex is praised for 95% reliability but less creativity.

TL;DR

Claude Code misses ~25% of instructions post-4.6. · Codex is 95% reliable but lacks flair. · Users advise always reviewing Claude Code output.

Claude Code quality dropped significantly after version 4.6, with users reporting ~25% of instructions missed and code review failures, per a Hacker News post. OpenAI's Codex is praised for 95% reliability but less creativity.

Key facts

  • ~25% of instructions missed in Claude Code post-4.6.
  • Codex 5.3 reported 95% reliability by the same user.
  • Claude Code usage limits doubled after 80x growth.
  • OpenAI merged Codex into ChatGPT on June 2, 2026.

A long-time Claude Code user posted on Hacker News that the tool's performance has "immensely worse" since the 4.7 and 4.8 releases. The user, who claims to identify model switches by output style alone, reports two key failures: the agent consistently misses close to a quarter of the ask, which cascades into bloat over time, and it literally fails to read existing code or use tools properly during review.

The user tried OpenAI's Codex as an alternative, finding it "remarkably precise about the exact changes I need, reliable nearly 95% of the time." However, Codex lacks Claude's "flair" and creative presentation, such as ascii block diagrams and idea generation. The user now splits workflows: Claude for brainstorming, Codex for execution.

The reliability-creativity tradeoff

This complaint aligns with a broader pattern in coding agents: the tension between precision and exploration. Claude Code, built on Claude Opus 4.6 and later models, prioritizes conversational breadth and tool use (e.g., MCP, Playwright). Codex 5.3, per OpenAI's documentation, is optimized for deterministic code generation. The user's experience suggests that as Anthropic pushed for more features and creative outputs post-4.6, reliability suffered — a tradeoff that may be inherent to the model's architecture or the agent's tool-use loop.

Community sentiment and historical context

The HN post has only 5 points and 0 comments as of writing, indicating this is an early signal, not a consensus. However, it echoes our earlier reporting on Claude Code usage limits doubling after 80x growth, suggesting rapid adoption may have strained quality. The user's advice — "Don't trust Claude when it says it finished something" — mirrors a common developer mantra: always review AI-generated code.

Notably, the user pins the regression to the 4.7 and 4.8 releases, not the underlying model. Claude Code is a product layer atop Claude Opus 4.6 (and later models), so the issue may be in the agent's instruction-following or tool-use logic, not the LLM itself. Anthropic has not publicly acknowledged the problem.

What Codex's rise means

This anecdote arrives as OpenAI merges Codex into ChatGPT, discontinuing the standalone API. If Codex becomes the default coding agent inside ChatGPT, its reliability advantage could attract users frustrated with Claude Code's post-4.6 drift. Conversely, Anthropic may need to tighten Claude Code's instruction adherence to retain its base.

What to watch

Watch for Anthropic's response to this quality regression and whether Claude Code 4.9 or a model update addresses instruction-following. Also track Codex adoption inside ChatGPT post-merger, which could accelerate if Claude Code reliability doesn't improve.


Source: news.ycombinator.com


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The HN post captures a classic tension in AI coding tools: breadth vs. depth. Claude Code's strength — creative brainstorming and multi-tool orchestration — may be its weakness when precision is required. The user's split workflow (Claude for ideas, Codex for execution) suggests that no single agent currently excels at both. This is a structural observation, not just a model quality issue. Anthropic's rapid growth — 80x user surge, doubling of limits — may have pushed the product toward feature velocity over reliability. The post-4.6 regression could be a consequence of shipping too fast, or it could reflect a fundamental limitation of the underlying model's ability to follow instructions in a tool-use loop. The fact that the user can detect model switches by output style alone indicates that the agent's behavior is tightly coupled to the base LLM, not just the product layer. Codex's reported 95% reliability is striking, but the user is a first-time Codex user — novelty bias may inflate the figure. Still, OpenAI's decision to merge Codex into ChatGPT suggests they see reliability as a key differentiator. If code reliability becomes the primary buying signal for developer tools, Anthropic may need to recalibrate Claude Code's priorities.
Compare side-by-side
Claude Code vs ChatGPT

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all