OpenAI GPT-5.5: 82.7% Terminal-Bench, 35.4% FrontierMath Tier 4

GPT-5.5 scores 82.7% on Terminal-Bench 2.0 and 35.4% on FrontierMath Tier 4, while reducing token usage per task. The model is rolling out to ChatGPT tiers and API soon, priced at $5/$30 per million tokens.

GAla Smith & AI Research Desk·4h ago·6 min read·11 views·AI-Generated·Report error

Source: x.comvia @kimmonismusCorroborated

OpenAI GPT-5.5: Efficiency and Intelligence Leap in New Frontier Model

OpenAI has released GPT-5.5, their latest frontier model, delivering significant improvements across coding, mathematics, scientific reasoning, and long-context recall while matching the latency of GPT-5.4. The model is already rolling out to ChatGPT Plus, Pro, Business, and Enterprise tiers, with API access coming shortly.

Key Takeaways

GPT-5.5 scores 82.7% on Terminal-Bench 2.0 and 35.4% on FrontierMath Tier 4, while reducing token usage per task.
The model is rolling out to ChatGPT tiers and API soon, priced at $5/$30 per million tokens.

Key Results

OpenAI’s Open‑Weight GPT‑OSS Models: A Comprehensive Review | by ...

The model achieves substantial gains over GPT-5.4 and Anthropic's Claude Opus 4.7 across multiple benchmarks:

Terminal-Bench 2.0 (agentic coding) 82.7% 75.1% 69.4% SWE-Bench Pro 58.6% — 64.3% Expert-SWE (internal, median human 20h) 73.1% — — GDPval (knowledge work) 84.9% — 80.3% OSWorld-Verified 78.7% — 78.0% Tau2-bench Telecom (no prompt tuning) 98.0% — — GeneBench (science) 25.0% 19.0% — BixBench 80.5% — — FrontierMath Tier 4 35.4% 27.1% 22.9% FrontierMath Tiers 1–3 51.7% — — MRCR 512K–1M tokens (long context) 74.0% 36.6% — CyberGym (cybersecurity) 81.8% — 73.1%

What's New

GPT-5.5 is OpenAI's "most capable model yet," designed to match GPT-5.4's latency while delivering higher intelligence across coding, knowledge work, and scientific research. The standout story is efficiency: GPT-5.5 uses fewer tokens to complete the same tasks, making it both smarter and cheaper to run per task.

OpenAI claims the model costs half as much as competitive frontier coding models on Artificial Analysis's Coding Index, with a 20%+ token generation speed boost from AI-optimized load balancing. The infrastructure is co-designed for and served on NVIDIA GB200 and GB300 NVL72 systems.

How It Works

While OpenAI has not released full architectural details, the model's efficiency gains appear to stem from a combination of:

Token efficiency: The model completes tasks with fewer generated tokens, reducing cost per task
AI-optimized serving: GPT-5.5 used Codex to help optimize its own serving infrastructure, a meta-learning approach to inference optimization
NVIDIA GB200/GB300 NVL72: Co-designed hardware-software integration for optimal throughput

The model also incorporates Codex (OpenAI's coding agent) into its own development and deployment pipeline — 85% of OpenAI employees already use Codex weekly across all departments.

Surprising Bits

Introducing GPT-5.5 | OpenAI

Claude Opus 4.7 still leads on SWE-Bench Pro (64.3% vs. 58.6%) and some long-context benchmarks
GPT-5.5 used Codex to help optimize its own serving infrastructure, a meta-optimization loop
Internal adoption: 85% of OpenAI uses Codex weekly across all departments
Mathematical discovery: Helped discover a new proof about Ramsey numbers verified in Lean

Pricing and Availability

Standard API $5 $30 Pro API $30 $180

Availability: Rolling out to Plus, Pro, Business, Enterprise in ChatGPT and Codex. API access "very soon."

gentic.news Analysis

GPT-5.5 represents a notable shift in OpenAI's strategy: rather than chasing raw benchmark supremacy on every metric, the company is optimizing for efficiency and real-world task completion. The 74.0% on MRCR at 512K–1M tokens (up from 36.6% for GPT-5.4) is a massive jump that suggests significant architectural changes in attention mechanisms or context compression. This follows a trend we've seen across the industry — Anthropic's Claude 3.5 Sonnet also showed strong long-context performance, and Google's Gemini 1.5 Pro pushed context windows to 10M tokens.

The model's use of Codex to optimize its own serving infrastructure is a fascinating meta-engineering move. This mirrors the trend we covered in our article on "Self-Improving AI Systems" — where models are increasingly used to optimize their own training and deployment pipelines. However, the fact that Claude Opus 4.7 still leads on SWE-Bench Pro suggests that Anthropic remains competitive in agentic coding tasks, and the gap may be narrowing rather than widening.

The FrontierMath Tier 4 result (35.4%) is particularly striking. Tier 4 problems are designed to be extremely difficult — requiring novel mathematical reasoning rather than pattern matching. This suggests GPT-5.5 is developing genuine reasoning capabilities beyond simple retrieval and pattern completion. The validated proof about Ramsey numbers in Lean is concrete evidence of this.

Frequently Asked Questions

When will GPT-5.5 be available via API?

OpenAI states API access will come "very soon." The model is currently rolling out to ChatGPT Plus, Pro, Business, and Enterprise tiers. Developers should expect API availability within weeks based on previous release patterns.

How does GPT-5.5 compare to Claude Opus 4.7?

GPT-5.5 leads on many benchmarks (Terminal-Bench 82.7% vs 69.4%, GDPval 84.9% vs 80.3%, FrontierMath Tier 4 35.4% vs 22.9%) but Claude Opus 4.7 still leads on SWE-Bench Pro (64.3% vs 58.6%) and some long-context evaluations. The models are competitive across different task categories.

What is the pricing for GPT-5.5?

Standard API pricing is $5 per million input tokens and $30 per million output tokens. A Pro tier is available at $30/$180 respectively. OpenAI claims this is half the cost of competitive frontier coding models on Artificial Analysis's Coding Index.

What hardware does GPT-5.5 run on?

GPT-5.5 is co-designed for and served on NVIDIA GB200 and GB300 NVL72 systems. The infrastructure includes AI-optimized load balancing that provides a 20%+ token generation speed boost.

Is GPT-5.5 better than GPT-5.4 at everything?

No. While GPT-5.5 shows substantial gains in most benchmarks (particularly long-context recall and mathematics), OpenAI has not released comparisons for all tasks. Claude Opus 4.7 still leads on SWE-Bench Pro, suggesting there are specific domains where GPT-5.5 may not be the top performer.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The 74.0% on MRCR at 512K–1M tokens (up from 36.6% for GPT-5.4) is a 2x improvement in long-context recall. This is the most technically significant result in the release — it suggests OpenAI has solved or substantially mitigated the 'lost in the middle' problem that has plagued transformer-based models. For practitioners building RAG systems or working with large codebases, this could reduce the need for chunking strategies and retrieval pipelines. The FrontierMath Tier 4 result (35.4% vs 27.1% for GPT-5.4) is also noteworthy. FrontierMath problems are designed to be extremely challenging — requiring novel mathematical reasoning rather than pattern matching. The validated proof about Ramsey numbers in Lean is concrete evidence of genuine reasoning capability. This aligns with the broader trend of models moving from pattern completion to actual reasoning, as seen in DeepSeek-R1's chain-of-thought improvements. The efficiency claim — half the cost of competitive models — is harder to verify without independent testing. But the token efficiency improvement (fewer tokens per task) combined with the 20% speed boost from optimized load balancing suggests real cost savings for heavy users. The fact that GPT-5.5 used Codex to optimize its own serving infrastructure is a meta-optimization loop that could accelerate future deployments.

#coding #gpt-5.5 #frontier models #benchmarks #openai

Mentioned in this article

Anthropic OpenAI ChatGPT GPT-5.3 Claude Opus 4.7 GPT-3.5 Terminal-Bench 2.0 FrontierMath Tier 4

Enjoyed this article?