A new benchmark specifically designed to test thermodynamic reasoning in large language models reveals significant performance degradation as problems move from simple property lookups to complex engineering analysis. ThermoQA, introduced in a March 2026 arXiv paper, evaluates six frontier models across 293 open-ended thermodynamics problems, with Claude Opus 4.6 leading the composite leaderboard at 94.1% accuracy.
Key Takeaways
- Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning.
- Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.
What ThermoQA Measures
ThermoQA consists of three distinct tiers designed to isolate different levels of reasoning:
- Tier 1: Property Lookups (110 questions) – Basic retrieval of thermodynamic properties (temperature, pressure, enthalpy) for given substances and states.
- Tier 2: Component Analysis (101 questions) – Analysis of individual thermodynamic components like compressors, turbines, and heat exchangers.
- Tier 3: Full Cycle Analysis (82 questions) – Complete analysis of thermodynamic cycles including combined-cycle gas turbines and refrigeration systems.
Ground truth answers are computed programmatically using CoolProp 7.2.0, covering water, R-134a refrigerant, and variable-specific-heat air. The benchmark focuses on engineering thermodynamics rather than theoretical physics, making it relevant for practical applications in energy systems, HVAC design, and mechanical engineering.
Key Results: Performance Degradation Reveals Reasoning Limits
The most striking finding is the significant performance drop as problems increase in complexity. While all models performed well on simple property lookups, their ability to handle multi-step thermodynamic reasoning varied dramatically.

Composite Leaderboard (Overall Accuracy):
Claude Opus 4.6 94.1% 96.4% 93.1% 93.6% GPT-5.4 93.1% 95.8% 92.3% 91.2% Gemini 3.1 Pro 92.5% 95.1% 91.8% 90.6% GPT-4o 88.7% 92.3% 87.5% 86.3% Claude Sonnet 4.6 85.2% 90.1% 83.4% 82.1% MiniMax 72.3% 89.5% 68.2% 57.0%Cross-Tier Degradation (Percentage Point Drop from Tier 1 to Tier 3):
- Claude Opus 4.6: 2.8 pp (minimal degradation)
- GPT-5.4: 4.6 pp
- Gemini 3.1 Pro: 4.5 pp
- GPT-4o: 6.0 pp
- Claude Sonnet 4.6: 8.0 pp
- MiniMax: 32.5 pp (severe degradation)
The researchers note that "property memorization does not imply thermodynamic reasoning"—a model can excel at looking up values without understanding how to apply them in engineering contexts.
Natural Discriminators and Consistency Metrics
Certain problem types revealed particularly large performance spreads:
- Supercritical water analysis: 40-60 percentage point performance spreads between models
- R-134a refrigerant cycle analysis: Similar 40-60 pp spreads
- Combined-cycle gas turbine analysis: Major discriminators with consistent large gaps
The benchmark also introduces a novel evaluation axis: reasoning consistency. By running each model three times on the same problems, the researchers measured multi-run sigma (standard deviation) ranging from ±0.1% (Claude Opus) to ±2.5% (MiniMax). This quantifies how consistently a model arrives at correct answers, separate from raw accuracy.
How ThermoQA Works: Programmatic Ground Truth
Unlike benchmarks that rely on human-labeled answers, ThermoQA generates ground truth programmatically using CoolProp, an open-source thermodynamic property database. This eliminates human error and ensures precise numerical accuracy for engineering applications. The benchmark includes:

- 293 open-ended questions requiring numerical answers with units
- Three working fluids: water (including supercritical states), R-134a refrigerant, and air with variable specific heat
- Real engineering scenarios: Rankine cycles, refrigeration cycles, Brayton cycles, and combined cycles
- Progressive difficulty: From single-property lookup to multi-component system analysis
The dataset and evaluation code are open-sourced at Hugging Face, allowing researchers and engineers to test their own models or fine-tune specifically for thermodynamic applications.
Why This Matters for Engineering Applications
ThermoQA arrives as AI models are increasingly deployed in technical domains. Just this week, we covered multiple developments in AI coding agents (Claude Code cheat sheet, AWS Bedrock's MCP tools), showing the trend toward specialized technical applications. Thermodynamic reasoning is particularly relevant for:
- Energy system design and optimization
- HVAC and refrigeration engineering
- Mechanical engineering education and assistance
- Technical documentation generation
The benchmark's finding that Claude Opus shows minimal degradation (2.8 pp) while MiniMax shows severe degradation (32.5 pp) suggests that some models have developed more robust reasoning capabilities than others, despite similar performance on simpler tasks.
gentic.news Analysis
This benchmark arrives at a critical moment in the evolution of LLM evaluation. Following recent arguments about LLM limitations for scientific discovery—including a Columbia professor's April 21st publication arguing that LLMs are fundamentally limited for scientific discovery due to their interpolation-based architecture—ThermoQA provides concrete evidence of where those limitations manifest. The 32.5 percentage point degradation for MiniMax on complex cycle analysis versus simple lookups starkly illustrates that high performance on retrieval tasks doesn't translate to multi-step engineering reasoning.

The results align with broader trends we've been tracking. Claude Opus 4.6's strong showing (94.1% overall, minimal degradation) continues its pattern of excellence in technical domains, which we've seen in coding benchmarks (Moonshot AI matches Claude Opus on coding) and agentic applications (Claude Code's 46:1 context cache ratio). The consistency metric (±0.1% sigma for Claude Opus) is particularly noteworthy for engineering applications where reproducible results are essential.
What's missing from this evaluation is how fine-tuned or specialized models would perform. Given the open-source nature of the dataset, we expect to see engineering-focused companies fine-tune smaller models specifically for thermodynamic applications, similar to how coding-specific models have emerged. The 40-60 percentage point spreads on supercritical water and combined-cycle analysis suggest there's substantial room for improvement through domain-specific training.
Frequently Asked Questions
What is ThermoQA?
ThermoQA is a benchmark of 293 open-ended engineering thermodynamics problems designed to evaluate the reasoning capabilities of large language models. It tests three tiers of difficulty: property lookups, component analysis, and full cycle analysis for systems using water, R-134a refrigerant, and air.
Which AI model performs best on thermodynamic reasoning?
According to the ThermoQA benchmark, Claude Opus 4.6 achieves the highest overall accuracy at 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%. Claude Opus also shows the smallest performance degradation between simple and complex problems (2.8 percentage points).
Why does model performance drop on complex thermodynamics problems?
The benchmark reveals that models can memorize thermodynamic properties without understanding how to apply them in multi-step engineering analysis. This "reasoning gap" causes performance drops of up to 32.5 percentage points for some models when moving from simple property lookups to full cycle analysis.
How can I test my own model on ThermoQA?
The dataset and evaluation code are open-sourced on Hugging Face at https://huggingface.co/datasets/olivenet/thermoqa. Researchers and engineers can use this to evaluate their own models or fine-tune specifically for thermodynamic applications.









