A 14-year principal engineer with a substantial codebase—80,000 lines of Python/TypeScript and 2,800 tests—has shared a detailed, hands-on comparison after 100 hours using Anthropic's Claude Code and 20 hours with OpenAI's Codex. The core finding is stark: AI coding assistants don't replace engineering judgment; they amplify it, for better or worse. The engineer's experience suggests that without rigorous guardrails, even state-of-the-art models can degrade code quality by prioritizing speed over sound architecture.
Key Takeaways
- A senior engineer with 100 hours in Claude Code and 20 in Codex reports Claude often rushes to patch, while Codex is more deliberate.
- The real product is the guardrail system—docs and review loops—not the AI itself.
The Workflow: Guardrails Are the Real Product

The engineer's successful workflow centers on containment and review, not raw AI output.
1. Plan Mode First: Before any code generation, a "plan mode" engages up to eight subagents. Each is tasked with reviewing a specific domain: architecture, coding standards, performance, UI design, and more. Critically, these agents are grounded in a library of reference documents the engineer built over time (e.g., postgres_performance.md, python_threading.md). This creates a knowledge-bound system, preventing the AI from hallucinating or applying generic, inappropriate patterns.
2. Phased Execution with Review Gates: Coding proceeds in phases. After each phase, code is committed. A code review agent then runs again on each commit, creating a continuous feedback loop. This mimics a rigorous CI/CD pipeline but is applied to the AI's incremental output.
3. Intentional Context Limitation: Despite access to massive 1M-token context windows, the engineer deliberately limits context to under 25%, calling large windows a "noob trap." The rationale is focus: overloading the model with the entire codebase can lead to distraction, increased cost, and slower performance. Strategic, relevant context is more effective than brute-force inclusion.
Model Behavior: Claude vs. Codex
The report highlights a fundamental difference in the models' operational temperaments.
Claude Code was described as feeling "like a senior dev on a deadline." Its tendencies included:
- Rushing to ship patches instead of undertaking necessary refactors.
- Spewing helper functions as a superficial fix when the underlying architectural issue was deeper.
- Ignoring provided documentation (
CLAUDE.md) in "almost every single session." - Moving "too fast" and requiring significant "babysitting" even within the guardrailed workflow.
Codex (OpenAI's earlier code model), by contrast, presented a "different vibe entirely"—slower and more deliberate. The implication is that its pace may lead to more considered, architecturally coherent suggestions, though the report is based on a smaller 20-hour sample size.
The Central Takeaway: Amplification, Not Replacement
The engineer's conclusion cuts through the hype: "These tools don't replace engineering judgment..they amplify it good or bad." The most valuable asset isn't the AI model itself, but the engineered system around it:
"Your
CLAUDE.md, your architecture docs, your review loops—that's the actual product...the AI is just execution."
This positions the AI as a powerful, but dumb, executor. Its output quality is directly proportional to the quality and specificity of the constraints, context, and processes provided by the human engineer.
What This Means for Engineering Teams

For teams integrating AI assistants, this report suggests several actionable insights:
- Invest in Guardrail Infrastructure: The ROI is not just in the AI tool license, but in the time spent creating and maintaining detailed, project-specific grounding documents and review protocols.
- Design for Oversight: Workflows must be designed assuming the AI will occasionally ignore instructions or take shortcuts. Automated review checkpoints are non-optional.
- Context is a Strategic Tool: Throwing the entire repository into context may be counterproductive. Curated context is a key engineering lever.
- Model Temperament Matters: Different models may suit different tasks—rapid prototyping vs. deliberate refactoring—and this should be part of the tool selection criteria.
gentic.news Analysis
This firsthand account validates a trend we've tracked since the widespread adoption of GitHub Copilot: the shift from tool evaluation to workflow engineering. Early debates centered on benchmark performance (e.g., HumanEval pass rates). Now, as this report shows, the critical differentiator for professional use is how the tool is integrated into a disciplined development process. This aligns with our previous coverage of Cursor's "agentic" workflow modes and Windsurf's focus on the entire code review cycle, not just generation.
The engineer's criticism of Claude Code's "rushing" behavior touches on a fundamental tension in LLM design for coding: the balance between completion speed and reasoning depth. Models optimized for quick, token-by-token prediction may inherently struggle with the slow, backtracking reasoning required for deep architectural changes. This observation provides real-world corroboration for research into speculative decoding and chain-of-thought techniques that explicitly separate planning from execution, a direction both Anthropic and OpenAI are actively pursuing.
Finally, the report's pragmatic dismissal of the 1M-context window as a "noob trap" is a significant reality check for the industry. It underscores that for core engineering tasks, retrieval precision and relevance often trump raw context length. This supports the continued importance of RAG (Retrieval-Augmented Generation) systems in professional tools, even as native context windows grow. The "product" is indeed the curated knowledge base (postgres_performance.md), not just the model that reads it.
Frequently Asked Questions
What is the main difference between Claude Code and Codex according to this engineer?
The principal engineer reported that Claude Code often behaves like a senior developer under deadline pressure, favoring quick patches and helper functions over deep refactoring, and sometimes ignores provided context. Codex, in contrast, was perceived as slower and more deliberate in its approach, potentially leading to more architecturally sound suggestions.
Why does the engineer limit context window usage to under 25%?
The engineer calls large context windows a "noob trap," arguing that overloading the model with the entire codebase can reduce focus, increase costs, and slow performance. The practice suggests that strategically curated, relevant context is more effective for maintaining output quality and efficiency than providing maximum possible context.
What does "your guardrails are the actual product" mean?
This means the value delivered by an AI coding assistant is primarily determined by the quality of the human-engineered system around it—the detailed architecture documents (CLAUDE.md), domain-specific reference guides, and automated review loops. The AI model itself is merely an execution engine; its output is only as good as the constraints and guidance provided by these guardrails.
How can teams implement the "plan mode" and subagent review workflow?
While not a feature of current off-the-shelf tools, this workflow can be approximated by using AI assistants in distinct phases. First, use the AI in a chat or planning mode to analyze requirements and generate a specification grounded in your project's docs. Then, use it to generate code against that plan in small, commit-sized chunks. Finally, use a separate AI review step (or a different model) to critique each commit against your coding standards and architecture documents before integration.









