Is Claude Code really worse after version 4.6?

According to a long-time user on Hacker News, yes — it misses ~25% of instructions and fails to read code properly during review. This is anecdotal but aligns with broader quality concerns.

How does Codex compare to Claude Code?

Codex is reported as 95% reliable for precise code changes but lacks Claude's creative flair, such as ascii diagrams and brainstorming.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A chart titled 'Claude Code Quality' showing a sharp drop in task success rate from 95% to 75% after version 4.6…

Products & LaunchesScore: 90

Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate

Claude Code quality dropped post-4.6 with ~25% instruction misses. Codex offers 95% reliability but less creativity.

AAAla SMITH & AI Research Desk·Jun 3, 2026·3 min read··234 views·AI-Generated·Report error

Source: news.ycombinator.comvia hn_claude_code, devto_claudecode, gn_mcp_protocol, medium_agentic, devto_mcp, gn_claude_apiWidely Reported

How does Claude Code quality compare to Codex after the 4.6 update?

Claude Code quality dropped significantly after version 4.6, with users reporting ~25% of instructions missed and code review failures, per a Hacker News post. OpenAI's Codex is praised for 95% reliability but less creativity.

TL;DR

Claude Code misses ~25% of instructions post-4.6. · Codex is 95% reliable but lacks flair. · Users advise always reviewing Claude Code output.

Key facts

~25% of instructions missed in Claude Code post-4.6.
Codex 5.3 reported 95% reliability by the same user.
Claude Code usage limits doubled after 80x growth.
OpenAI merged Codex into ChatGPT on June 2, 2026.

A long-time Claude Code user posted on Hacker News that the tool's performance has "immensely worse" since the 4.7 and 4.8 releases. The user, who claims to identify model switches by output style alone, reports two key failures: the agent consistently misses close to a quarter of the ask, which cascades into bloat over time, and it literally fails to read existing code or use tools properly during review.

The user tried OpenAI's Codex as an alternative, finding it "remarkably precise about the exact changes I need, reliable nearly 95% of the time." However, Codex lacks Claude's "flair" and creative presentation, such as ascii block diagrams and idea generation. The user now splits workflows: Claude for brainstorming, Codex for execution.

The reliability-creativity tradeoff

This complaint aligns with a broader pattern in coding agents: the tension between precision and exploration. Claude Code, built on Claude Opus 4.6 and later models, prioritizes conversational breadth and tool use (e.g., MCP, Playwright). Codex 5.3, per OpenAI's documentation, is optimized for deterministic code generation. The user's experience suggests that as Anthropic pushed for more features and creative outputs post-4.6, reliability suffered — a tradeoff that may be inherent to the model's architecture or the agent's tool-use loop.

Community sentiment and historical context

The HN post has only 5 points and 0 comments as of writing, indicating this is an early signal, not a consensus. However, it echoes our earlier reporting on Claude Code usage limits doubling after 80x growth, suggesting rapid adoption may have strained quality. The user's advice — "Don't trust Claude when it says it finished something" — mirrors a common developer mantra: always review AI-generated code.

Notably, the user pins the regression to the 4.7 and 4.8 releases, not the underlying model. Claude Code is a product layer atop Claude Opus 4.6 (and later models), so the issue may be in the agent's instruction-following or tool-use logic, not the LLM itself. Anthropic has not publicly acknowledged the problem.

What Codex's rise means

This anecdote arrives as OpenAI merges Codex into ChatGPT, discontinuing the standalone API. If Codex becomes the default coding agent inside ChatGPT, its reliability advantage could attract users frustrated with Claude Code's post-4.6 drift. Conversely, Anthropic may need to tighten Claude Code's instruction adherence to retain its base.

What to watch

Watch for Anthropic's response to this quality regression and whether Claude Code 4.9 or a model update addresses instruction-following. Also track Codex adoption inside ChatGPT post-merger, which could accelerate if Claude Code reliability doesn't improve.

Source: news.ycombinator.com

Source: gentic.news · Jun 3, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The HN post captures a classic tension in AI coding tools: breadth vs. depth. Claude Code's strength — creative brainstorming and multi-tool orchestration — may be its weakness when precision is required. The user's split workflow (Claude for ideas, Codex for execution) suggests that no single agent currently excels at both. This is a structural observation, not just a model quality issue. Anthropic's rapid growth — 80x user surge, doubling of limits — may have pushed the product toward feature velocity over reliability. The post-4.6 regression could be a consequence of shipping too fast, or it could reflect a fundamental limitation of the underlying model's ability to follow instructions in a tool-use loop. The fact that the user can detect model switches by output style alone indicates that the agent's behavior is tightly coupled to the base LLM, not just the product layer. Codex's reported 95% reliability is striking, but the user is a first-time Codex user — novelty bias may inflate the figure. Still, OpenAI's decision to merge Codex into ChatGPT suggests they see reliability as a key differentiator. If code reliability becomes the primary buying signal for developer tools, Anthropic may need to recalibrate Claude Code's priorities.

#claude code #coding agents #anthropic #codex #openai

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Claude Code vs ChatGPT

→

Mentioned in this article

Claude Code Codex 5.3 OpenAI ChatGPT

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

Donate Claude Code Traces to Hugging Face's Open Dataset in One Command

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Claude Code Quality Drops Post-4.6, Users Report 25% Task Failure Rate

The reliability-creativity tradeoff

Community sentiment and historical context

What Codex's rise means

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Why Claude Code's 80.8% SWE-Bench Score and 1M Context Window Beat Codex

OpenAI Offers Washington 5% of $852B Valuation to Ease AI Pressure

How to Use MCP Servers for Financial Data

Meta Bans Claude Code, Codex to Block Distillation as Internal Spend Hits

Anthropic Launches Claude Tag as Multiplayer Slack Agent Ahead of IPO

Donate Claude Code Traces to Hugging Face's Open Dataset in One Command

The framework underneath this story

More in Products & Launches

Microsoft to Deploy AMD Helios Rack-Scale AI at Scale on Azure

China Blank-Checks Huawei for Inference Chip Push

Alibaba Qwen3.8: 2.4T Parameter Open-Weight Model Incoming