New research reveals a critical flaw in how developers evaluate reasoning model costs: listed API prices can be dramatically misleading, with the supposedly cheaper model often costing significantly more in practice. The study, analyzing 8 frontier reasoning models across 9 tasks, found that 21.8% of model-pair comparisons exhibit "pricing reversal"—where the model with the lower listed price actually incurs higher costs when deployed.
The magnitude of these reversals reaches up to 28x, fundamentally challenging how teams budget for and select reasoning models in production systems.
The Pricing Reversal Phenomenon
The research paper presents concrete examples that upend conventional wisdom about model economics:
- Gemini 3 Flash is listed as 78% cheaper than GPT-5.2 ($0.10 vs. $0.45 per million output tokens), yet its actual cost per task is 22% higher in practice
- Claude Opus 4.6 is listed at 2x the price of Gemini 3.1 Pro, but actually costs 35% less when deployed
- Across 72 model-pair comparisons, 21.8% exhibited such reversals
These findings suggest that teams relying solely on published API pricing tables are making economically suboptimal decisions approximately one-fifth of the time.
Root Cause: Thinking Token Heterogeneity
The primary driver of these cost discrepancies is what researchers term "thinking token heterogeneity"—the dramatic variation in how many internal reasoning tokens different models use to solve the same problem.
On identical queries, one model may consume 900% more thinking tokens than another. Since most providers charge for these internal computation steps (though sometimes at different rates than regular tokens), this creates massive cost variations that aren't captured by simple per-token pricing comparisons.
Gemini 3 Flash vs. GPT-5.2 78% cheaper listed 22% more expensive actual Claude Opus 4.6 vs. Gemini 3.1 Pro 2x more expensive listed 35% cheaper actual Worst-case reversal - 28x cost discrepancyPractical Implications for Production Systems
The research team provides actionable guidance for developers:
Benchmark actual costs, not listed prices: The authors release code and data for per-task cost auditing, enabling teams to measure true deployment economics
Consider thinking token efficiency: Models with higher per-token prices may be more economical if they use significantly fewer thinking tokens
Remove thinking token costs: The study found that eliminating thinking token charges would reduce ranking reversals by 70%, suggesting providers could create more transparent pricing models
Task-specific evaluation matters: Cost efficiency varies significantly across different task types, making generalized comparisons unreliable
Methodology and Scope
The research evaluated 8 frontier reasoning models (including GPT-5.2, Claude Opus 4.6, Gemini 3 Flash, and Gemini 3.1 Pro) across 9 diverse reasoning tasks. The team measured both listed API prices and actual deployment costs, accounting for:
- Input token counts
- Output token counts
- Thinking/chain-of-thought token usage
- Provider-specific pricing structures
- Task completion rates and accuracy
The complete dataset and auditing tools are available in the accompanying repository, allowing teams to replicate the analysis for their specific use cases.
What This Means for AI Development Teams
For engineering leaders building with reasoning models, this research necessitates a shift in procurement strategy:
- Budgeting: Development budgets based on listed prices may be off by significant margins
- Vendor selection: The "cheapest" vendor on paper may be the most expensive in production
- Performance testing: Cost benchmarking must become a standard part of model evaluation alongside accuracy and latency
- Contract negotiation: Teams should push for more transparent pricing that accounts for thinking token variability
The findings are particularly relevant as reasoning models become central to agentic systems, where thinking token consumption represents the majority of computational cost.
gentic.news Analysis
This research arrives at a critical inflection point in the reasoning model market. As we've covered extensively, 2025 saw explosive growth in agentic AI deployments, with companies like Cognition Labs and Magic.dev pushing reasoning models into production at scale. The economic implications of thinking token costs were largely theoretical until now—this study provides the first empirical evidence of their market-distorting effects.
The pricing reversals identified align with a broader trend we've observed: the decoupling of listed prices from total cost of ownership in AI infrastructure. Similar dynamics emerged in the GPU market, where cheaper upfront hardware costs were often offset by higher power consumption and maintenance expenses. What's novel here is the opacity—while electricity costs are measurable, thinking token usage has been a black box until this research.
Notably, the study's release coincides with increased regulatory scrutiny of AI pricing transparency. The EU AI Act's provisions on algorithmic transparency, which took full effect in January 2026, may create legal pressure for providers to disclose thinking token costs more clearly. This research provides the methodological foundation for such disclosures.
The 28x worst-case discrepancy is particularly striking when viewed against the backdrop of the ongoing "inference cost war" among major providers. As we reported in November 2025, Anthropic cut Claude API prices by 40%, following similar moves by OpenAI and Google. This research suggests those headline price cuts may be less meaningful than they appear if thinking token consumption patterns aren't considered.
For practitioners, the immediate takeaway is clear: model selection must now include a cost benchmarking phase that measures actual token consumption on representative workloads. The era of comparing price sheets is over.
Frequently Asked Questions
Why do thinking tokens cause such large cost discrepancies?
Thinking tokens represent the internal computation a model performs before generating a final answer. Different architectures and training approaches lead to dramatically different thinking patterns—some models "think" extensively before answering, while others generate answers more directly. Since providers charge for these computational steps (often at different rates than input/output tokens), small differences in thinking efficiency compound into large cost variations.
How can I benchmark actual costs for my specific use case?
The researchers have released code and datasets that can be adapted to specific workloads. The basic approach involves: 1) Running your representative prompts through candidate models, 2) Capturing total token usage (input + output + thinking), 3) Applying provider-specific pricing, and 4) Comparing total costs rather than listed per-token rates. For accuracy, you need sufficient sample size to account for variability in thinking token usage.
Are some providers more transparent about thinking token costs than others?
Currently, transparency varies significantly. Some providers itemize thinking token charges separately, while others bundle them into different pricing tiers. The research found no consistent correlation between pricing transparency and actual cost efficiency—the most transparent provider isn't necessarily the most economical for a given task. This inconsistency is why benchmarking is essential.
Will this research lead to changes in how providers price their models?
Pressure is building for more transparent pricing models. The research shows that removing thinking token costs would reduce pricing reversals by 70%, suggesting simpler pricing could benefit both providers and customers. However, thinking tokens represent real computational expense, so providers are unlikely to absorb these costs entirely. More likely outcomes include: standardized thinking token reporting, task-based pricing tiers, or efficiency guarantees for specific workload types.




