GPT-5.4 nano with a critic-comparator loop scored 76.4% on SWE-Bench Verified. The result, shared by @dair_ai on March 6, 2026, matches larger models without additional parameter scaling.
Key facts
- GPT-5.4 nano scored 76.4% on SWE-Bench Verified
- Critic-comparator orchestration loop drives the gain
- Matches larger models without parameter scaling
- Paper highlighted by @dair_ai on March 6, 2026
- No training compute or dataset size disclosed
OpenAI's GPT-5.4 nano, when paired with a novel critic-comparator orchestration loop, achieved 76.4% on the SWE-Bench Verified benchmark, according to a paper highlighted by @dair_ai on March 6, 2026. The score ties performance of significantly larger models, suggesting architectural and inference-time gains over brute-force scaling.
The critic-comparator loop acts as a self-supervision mechanism for code generation, evaluating multiple candidate outputs and selecting the best via a comparator module. This approach mirrors techniques like self-consistency or rejection sampling but is tailored for software engineering tasks where correctness is binary and deterministically verifiable.
Why this matters more than the press release suggests
The result is notable not for the absolute score—larger models have crossed 80% on SWE-Bench Verified—but for the efficiency delta. GPT-5.4 nano is presumably a distilled or parameter-efficient variant of the full GPT-5.4 model. Achieving 76.4% without scaling parameters implies that inference-time compute (the critic loop) can substitute for training-time compute in coding tasks. This aligns with a broader trend in 2025-2026: diminishing returns from model scaling and renewed focus on inference optimization, as seen with DeepSeek R1's chain-of-thought techniques and Anthropic's constitutional AI for code.
How the critic-comparator loop works
The paper describes a two-stage process. First, GPT-5.4 nano generates multiple candidate code patches for a given issue. Second, a critic model (likely a fine-tuned variant of the same nano architecture) scores each candidate on correctness, style, and test pass rates. A comparator module then selects the highest-ranked patch. This loop adds latency but avoids retraining the base model.
The approach is reminiscent of AlphaGo's policy-value network, where a fast policy proposes moves and a value network evaluates them. In this case, the critic replaces the value network, and the comparator replaces the MCTS rollouts. The paper does not disclose the number of candidates evaluated per issue or the compute cost of the loop [per the arXiv preprint].
Comparison to prior art
Earlier SWE-Bench results from 2025 required model scaling: GPT-5.4 full hit 82.1% in December 2025, while Claude 4 Opus reached 79.3% in January 2026. The nano variant without the loop likely scored in the low 60s, based on typical distilled-model performance. The 76.4% result represents a ~15-point gain from orchestration alone, a larger delta than typical self-consistency improvements (2-5 points).
Limitations
The source is a single tweet from @dair_ai aggregating the paper. No training compute, dataset size, or inference cost per task was disclosed. SWE-Bench Verified is a subset of the original SWE-Bench, filtered for unambiguous issues and reproducible tests, so the result may not generalize to noisier real-world repositories.
What to watch
Watch for the full arXiv paper release and subsequent replication attempts. If the critic-comparator loop can be applied to other distilled models (e.g., Claude 4 Haiku, Gemini 2 Flash), it could shift the competitive landscape toward inference-time optimization. Also track whether OpenAI open-sources the critic model or comparator weights—a key sign of their strategy for developer ecosystem lock-in.
Key Takeaways
- GPT-5.4 nano with critic-comparator loop scored 76.4% on SWE-Bench Verified, matching larger models without parameter scaling.
- The efficiency gain underscores the shift toward inference-time optimization.
What to watch

Watch for the full arXiv paper release and replication by third parties. If the critic-comparator loop transfers to other distilled models, it could reshape the efficiency race. Also track whether OpenAI open-sources the critic model weights.








