Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A bar chart showing AI model accuracy scores on thermodynamic reasoning, with Claude Opus leading at 94.1% and other…

ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%

Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.

AAAla SMITH & AI Research Desk·Apr 23, 2026·6 min read··105 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_ai, reddit_claudeMulti-Source

TL;DR

A new thermodynamics benchmark shows Claude Opus 4.6 leads frontier models, but performance drops up to 32.5 percentage points on complex reasoning tasks.

ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%

A new benchmark specifically designed to test thermodynamic reasoning in large language models reveals significant performance degradation as problems move from simple property lookups to complex engineering analysis. ThermoQA, introduced in a March 2026 arXiv paper, evaluates six frontier models across 293 open-ended thermodynamics problems, with Claude Opus 4.6 leading the composite leaderboard at 94.1% accuracy.

Key Takeaways

Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning.
Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.

What ThermoQA Measures

ThermoQA consists of three distinct tiers designed to isolate different levels of reasoning:

Tier 1: Property Lookups (110 questions) – Basic retrieval of thermodynamic properties (temperature, pressure, enthalpy) for given substances and states.
Tier 2: Component Analysis (101 questions) – Analysis of individual thermodynamic components like compressors, turbines, and heat exchangers.
Tier 3: Full Cycle Analysis (82 questions) – Complete analysis of thermodynamic cycles including combined-cycle gas turbines and refrigeration systems.

Ground truth answers are computed programmatically using CoolProp 7.2.0, covering water, R-134a refrigerant, and variable-specific-heat air. The benchmark focuses on engineering thermodynamics rather than theoretical physics, making it relevant for practical applications in energy systems, HVAC design, and mechanical engineering.

Key Results: Performance Degradation Reveals Reasoning Limits

The most striking finding is the significant performance drop as problems increase in complexity. While all models performed well on simple property lookups, their ability to handle multi-step thermodynamic reasoning varied dramatically.

Figure 6: Performance by analysis depth. (a) Tier 2: counter-intuitive improvement at Depth C for frontier models. (b) T

Composite Leaderboard (Overall Accuracy):

Claude Opus 4.6 94.1% 96.4% 93.1% 93.6% GPT-5.4 93.1% 95.8% 92.3% 91.2% Gemini 3.1 Pro 92.5% 95.1% 91.8% 90.6% GPT-4o 88.7% 92.3% 87.5% 86.3% Claude Sonnet 4.6 85.2% 90.1% 83.4% 82.1% MiniMax 72.3% 89.5% 68.2% 57.0%

Cross-Tier Degradation (Percentage Point Drop from Tier 1 to Tier 3):

Claude Opus 4.6: 2.8 pp (minimal degradation)
GPT-5.4: 4.6 pp
Gemini 3.1 Pro: 4.5 pp
GPT-4o: 6.0 pp
Claude Sonnet 4.6: 8.0 pp
MiniMax: 32.5 pp (severe degradation)

The researchers note that "property memorization does not imply thermodynamic reasoning"—a model can excel at looking up values without understanding how to apply them in engineering contexts.

Natural Discriminators and Consistency Metrics

Certain problem types revealed particularly large performance spreads:

Supercritical water analysis: 40-60 percentage point performance spreads between models
R-134a refrigerant cycle analysis: Similar 40-60 pp spreads
Combined-cycle gas turbine analysis: Major discriminators with consistent large gaps

The benchmark also introduces a novel evaluation axis: reasoning consistency. By running each model three times on the same problems, the researchers measured multi-run sigma (standard deviation) ranging from ±0.1% (Claude Opus) to ±2.5% (MiniMax). This quantifies how consistently a model arrives at correct answers, separate from raw accuracy.

How ThermoQA Works: Programmatic Ground Truth

Unlike benchmarks that rely on human-labeled answers, ThermoQA generates ground truth programmatically using CoolProp, an open-source thermodynamic property database. This eliminates human error and ensures precise numerical accuracy for engineering applications. The benchmark includes:

Figure 2: Cross-tier degradation: performance drop from Tier 1 (property lookups) to Tier 3 (cycle analysis). Smaller is

293 open-ended questions requiring numerical answers with units
Three working fluids: water (including supercritical states), R-134a refrigerant, and air with variable specific heat
Real engineering scenarios: Rankine cycles, refrigeration cycles, Brayton cycles, and combined cycles
Progressive difficulty: From single-property lookup to multi-component system analysis

The dataset and evaluation code are open-sourced at Hugging Face, allowing researchers and engineers to test their own models or fine-tune specifically for thermodynamic applications.

Why This Matters for Engineering Applications

ThermoQA arrives as AI models are increasingly deployed in technical domains. Just this week, we covered multiple developments in AI coding agents (Claude Code cheat sheet, AWS Bedrock's MCP tools), showing the trend toward specialized technical applications. Thermodynamic reasoning is particularly relevant for:

Energy system design and optimization
HVAC and refrigeration engineering
Mechanical engineering education and assistance
Technical documentation generation

The benchmark's finding that Claude Opus shows minimal degradation (2.8 pp) while MiniMax shows severe degradation (32.5 pp) suggests that some models have developed more robust reasoning capabilities than others, despite similar performance on simpler tasks.

gentic.news Analysis

This benchmark arrives at a critical moment in the evolution of LLM evaluation. Following recent arguments about LLM limitations for scientific discovery—including a Columbia professor's April 21st publication arguing that LLMs are fundamentally limited for scientific discovery due to their interpolation-based architecture—ThermoQA provides concrete evidence of where those limitations manifest. The 32.5 percentage point degradation for MiniMax on complex cycle analysis versus simple lookups starkly illustrates that high performance on retrieval tasks doesn't translate to multi-step engineering reasoning.

$Figure 1: Per-tier performance for all six models (3-run mean with ±σ\pm\sigma error bars). Tier difficulty increases le$

The results align with broader trends we've been tracking. Claude Opus 4.6's strong showing (94.1% overall, minimal degradation) continues its pattern of excellence in technical domains, which we've seen in coding benchmarks (Moonshot AI matches Claude Opus on coding) and agentic applications (Claude Code's 46:1 context cache ratio). The consistency metric (±0.1% sigma for Claude Opus) is particularly noteworthy for engineering applications where reproducible results are essential.

What's missing from this evaluation is how fine-tuned or specialized models would perform. Given the open-source nature of the dataset, we expect to see engineering-focused companies fine-tune smaller models specifically for thermodynamic applications, similar to how coding-specific models have emerged. The 40-60 percentage point spreads on supercritical water and combined-cycle analysis suggest there's substantial room for improvement through domain-specific training.

Frequently Asked Questions

What is ThermoQA?

ThermoQA is a benchmark of 293 open-ended engineering thermodynamics problems designed to evaluate the reasoning capabilities of large language models. It tests three tiers of difficulty: property lookups, component analysis, and full cycle analysis for systems using water, R-134a refrigerant, and air.

Which AI model performs best on thermodynamic reasoning?

According to the ThermoQA benchmark, Claude Opus 4.6 achieves the highest overall accuracy at 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%. Claude Opus also shows the smallest performance degradation between simple and complex problems (2.8 percentage points).

Why does model performance drop on complex thermodynamics problems?

The benchmark reveals that models can memorize thermodynamic properties without understanding how to apply them in multi-step engineering analysis. This "reasoning gap" causes performance drops of up to 32.5 percentage points for some models when moving from simple property lookups to full cycle analysis.

How can I test my own model on ThermoQA?

The dataset and evaluation code are open-sourced on Hugging Face at https://huggingface.co/datasets/olivenet/thermoqa. Researchers and engineers can use this to evaluate their own models or fine-tune specifically for thermodynamic applications.

Source: gentic.news · Apr 23, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

ThermoQA represents a significant step forward in domain-specific LLM evaluation, moving beyond general reasoning benchmarks to test concrete engineering capabilities. The three-tier structure effectively isolates different cognitive loads: retrieval, component-level reasoning, and system-level synthesis. The finding that performance degradation ranges from 2.8 to 32.5 percentage points reveals fundamental differences in how models handle complexity—not just in thermodynamics, but likely in other technical domains as well. The consistency metric (±0.1% to ±2.5% sigma across runs) introduces a valuable new dimension to model evaluation. For engineering applications, consistency is as important as accuracy—an AI assistant that gives different answers to the same problem is unreliable for design work. This aligns with trends we're seeing in agent evaluation frameworks, like the LangFuse system for evaluating AI agents in production that we covered last week. Looking forward, ThermoQA's open-source nature will likely spur two developments: first, fine-tuned thermodynamic specialists (similar to coding-specific models like Claude Code), and second, similar benchmarks for other engineering domains (electrical, structural, chemical). The large performance spreads on discriminators like supercritical water analysis (40-60 pp) suggest there's substantial headroom for improvement through targeted training, potentially creating a new niche for domain-specific models.

#large-language-models #anthropic #research #benchmarks #engineering

Mentioned in this article

Claude Opus 4.6 ThermoQA

Enjoyed this article?