Multi-Agent Coding Systems Compared: Claude Code, Codex, and Cursor
What Happened
A detailed, hands-on comparison of three leading multi-agent coding systems—Claude Code 4.6, Codex 5.3, and Cursor 2.4—reveals fundamentally different mental models of what "multi-agent" means in practice. The experiment tested each system across three distinct tasks designed to stress different capabilities: structured workflow with formatted artifacts, parallel research with quality assurance, and role-based coordination with verification.
Technical Details: Three Philosophies of Multi-Agent Execution
Claude Code: Architectural Distinction Between Subagents and Agent Teams
Claude Code is the only system that formally distinguishes between two modes of multi-agent operation, and this distinction is architecturally meaningful rather than just semantic. In the "subagent" model, an orchestrator dispatches specialized agents with fully specified constraints (e.g., "you are NOT allowed to read main.py") and assembles results. In the "agent teams" model, agents operate with genuine lateral communication—a QA agent placed on explicit standby only unblocks once researchers have delivered their findings.
This dual-primitive approach signals Anthropic's careful consideration of when parallelism alone suffices versus when genuine inter-agent coordination is necessary. In testing, Claude Code demonstrated clear orchestration behavior with simultaneous agent launches and explicit wait-and-unblock patterns, though evidence of parallelism lives in conversation traces rather than independently verifiable artifacts.
Codex: Multi-Agent as a Software Engineering Problem
Codex implements multi-agent through a native subagent system enabled by default, but with a crucial constraint: without explicit request, it runs as a single sequential agent. When activated, Codex designs Python classes (e.g., CatalogScoutAgent, ModelSpecialistAgent) and uses ThreadPoolExecutor to run them. However, these agents are objects within a Python program that Codex writes and delivers as a repository artifact.
A significant nuance emerged: Codex often reframes research problems as engineering problems. When asked for agents doing web research across different sources simultaneously, Codex delivered parallel dictionary lookups over a single pre-fetched dataset—solving the engineering problem excellently while performing lighter work than the prompt suggested. Codex also demonstrates superior error transparency, explicitly documenting what it could not access (e.g., JS-rendered pages) rather than hiding limitations.
Cursor: Multi-Agent as Parallel File-System Operations
Cursor implements the most literal interpretation of parallelism: separate agents, separate files, separate outputs. In testing, Agent 1 produced kimi_official_pricing.json, Agent 2 produced kimi_third_party_providers.json, and the QA agent produced both quality-check reports and summaries as separate files.
Cursor provides the most verifiable evidence of parallel execution through independently authored file artifacts with different content—a gold standard for verification that cannot be faked by sequential systems. The system also uses files as coordination mechanisms (e.g., API_CONTRACT.md at project root) that serve as both coordination tools and living documentation. However, Cursor offers limited error commentary, with execution difficulties remaining invisible in clean, structured output files.
Key Findings: Real vs. Performed Parallelism
Execution Evidence
- Claude Code: Parallel execution evidenced by orchestration behavior (simultaneous launches, wait-and-unblock patterns) but relies on system self-report
- Codex: Parallelism coded into artifacts (
ThreadPoolExecutor) but often sequential in practice, with I/O operations completed before "parallel" work begins - Cursor: Parallel execution verified through independently authored file artifacts—the most objectively verifiable approach
Output Characteristics
- Claude Code: Produces polished, confident outputs with rich synthesis but buries source access limitations
- Codex: Explicit about access failures and limitations, with detailed scope notes and warnings sections
- Cursor: Delivers clean, structured file artifacts but offers minimal narrative about execution difficulties
Engineering vs. Research Focus
- Claude Code: Casts the widest research net with the richest synthesis capabilities
- Codex: Excels at engineering solutions, reframing problems as software challenges
- Cursor: Focuses on deliverable file quality with verifiable parallel execution
Retail & Luxury Implications
While this comparison focuses on coding systems, the underlying multi-agent paradigms have significant implications for retail and luxury AI applications:
Supply Chain and Inventory Management
The different multi-agent approaches map directly to complex retail challenges. Claude Code's agent teams model could coordinate real-time inventory tracking across warehouses, stores, and suppliers with genuine lateral communication. Codex's engineering-focused approach might excel at optimizing logistics algorithms and route planning. Cursor's file-based parallelism could handle simultaneous price monitoring across multiple competitors and regions, with each agent producing independently verifiable data files.
Customer Service and Personalization
Multi-agent systems could revolutionize customer interactions by orchestrating specialized agents for different aspects of the luxury experience:
- Product specialists accessing inventory databases
- Style advisors referencing customer preference histories
- Pricing agents checking promotions and loyalty benefits
- Coordination agents ensuring consistent messaging across touchpoints
Claude Code's distinction between subagents and teams becomes particularly relevant here—simple parallel queries (subagents) versus complex, coordinated customer journeys (agent teams).
Competitive Intelligence and Market Analysis
The testing methodology itself—parallel research across multiple sources with quality assurance—mirrors exactly what luxury brands need for market intelligence. Cursor's artifact-based approach provides auditable trails of competitive pricing research. Codex's transparency about data limitations ensures brands understand the reliability of intelligence reports. Claude Code's synthesis capabilities could combine pricing, social sentiment, and trend data into actionable insights.
Implementation Considerations for Retail
- Auditability Requirements: Luxury brands with strict compliance needs might prefer Cursor's file-based approach for verifiable audit trails
- Error Sensitivity: Brands requiring complete transparency about data limitations might favor Codex's explicit error documentation
- Complex Coordination: Scenarios requiring genuine agent interaction (e.g., coordinating personalized offers across channels) align with Claude Code's team model
- Integration Complexity: All systems would require significant adaptation to integrate with existing retail CRM, ERP, and e-commerce platforms
Current Limitations and Future Potential
The comparison reveals that current multi-agent systems are primarily designed for software development tasks. Adaptation to retail domains would require:
- Custom connectors to retail data sources (inventory, CRM, POS systems)
- Domain-specific training on luxury product knowledge and brand guidelines
- Integration with existing business intelligence and analytics platforms
- Robust security and privacy controls for customer data
However, the fundamental paradigms—parallel execution, specialized agents, coordination mechanisms—provide a blueprint for how AI could transform retail operations. As these systems mature, we can expect more domain-specific implementations that address the unique needs of luxury retail, from personalized clienteling at scale to dynamic pricing optimization across global markets.






