Semantic Invariance Study Finds Qwen3-30B-A3B Most Robust LLM Agent, Outperforming Larger Models
As large language models (LLMs) are increasingly deployed as autonomous reasoning agents in scientific, business, and decision-support systems, a critical reliability question emerges: does their reasoning remain stable when the same problem is presented with different wording? A new research paper titled "Semantic Invariance in Agentic AI" introduces a systematic testing framework to answer this question, with surprising results that challenge conventional wisdom about model scaling.
The study, published on arXiv, argues that standard benchmark evaluations—which test accuracy on fixed, canonical problem formulations—fail to capture this critical dimension of reliability. The researchers term the desired property semantic invariance: an agent's ability to produce consistent, equivalent reasoning outputs when given semantically equivalent but syntactically varied inputs.
What the Researchers Built: A Metamorphic Testing Framework
The core contribution is a metamorphic testing framework designed to systematically assess the robustness of LLM reasoning agents. Instead of evaluating final answer correctness alone, the framework measures whether an agent's reasoning process remains stable under semantic-preserving transformations.
The researchers defined eight specific transformation types:
- Identity: The original, canonical problem formulation (baseline)
- Paraphrase: Restating the problem with different wording
- Fact Reordering: Changing the order of presented facts or constraints
- Expansion: Adding explanatory context that doesn't change the problem
- Contraction: Removing non-essential details
- Academic Context: Framing the problem within academic literature
- Business Context: Framing the problem within business requirements
- Contrastive Formulation: Presenting the problem as a contrast between alternatives
Key Results: Scale Doesn't Predict Robustness
The evaluation tested seven foundation models across four architectural families on 19 multi-step reasoning problems spanning eight scientific domains (including physics, chemistry, biology, and engineering).

The most striking finding: model scale does not predict robustness. In fact, the relationship appears inverse for some model families.
Qwen3-30B-A3B 30B 79.6% 0.91 Qwen3-235B-A22B 235B 72.1% 0.87 Hermes-70B 70B 68.3% 0.84 Hermes-405B 405B 65.8% 0.82 DeepSeek-R1 Not specified 71.4% 0.86 gpt-oss-20B 20B 66.7% 0.83 gpt-oss-120B 120B 64.2% 0.81Note: Semantic similarity measured on a 0-1 scale where 1.0 indicates perfect semantic equivalence between responses to transformed and original inputs.
The smaller Qwen3-30B-A3B achieved the highest stability (79.6% invariant responses with semantic similarity 0.91), while larger models in the same family and across other families exhibited greater fragility. The Hermes-405B model, despite having over 13 times more parameters than the top-performing Qwen3 model, showed significantly lower robustness (65.8% invariant responses).
How It Works: Measuring Reasoning Consistency
The testing methodology involves three key components:

Problem Selection: 19 multi-step reasoning problems requiring chain-of-thought reasoning, drawn from scientific domains where precise reasoning is critical.
Transformation Application: Each problem undergoes all eight semantic-preserving transformations, creating eight semantically equivalent variants.
Response Analysis: For each variant, the researchers collect the LLM's reasoning trace (chain-of-thought) and final answer, then compare it to the response from the canonical formulation using:
- Binary invariance classification: Human evaluation of whether the reasoning process is semantically equivalent
- Semantic similarity scoring: Automated measurement using embedding-based similarity metrics
Aggregate Metrics: The percentage of invariant responses across all transformations and problems, plus average semantic similarity scores.
The framework is implemented as a modular Python system that can be extended with additional transformations and evaluation domains. The paper includes specific examples showing how identical chemical engineering problems, when framed in academic versus business contexts, elicited dramatically different reasoning paths from the same model.
Why It Matters: Reliability Gaps in Agentic AI Deployment
This research highlights a significant reliability gap in current LLM agent evaluations. An agent that scores 90% on a standard benchmark might produce inconsistent reasoning when the same problem arrives through different channels (e.g., a user's informal paraphrase versus a formal business document).

For consequential applications like scientific problem-solving, medical diagnosis support, or financial decision-making, this instability represents a serious deployment risk. The findings suggest that:
- Current benchmarks are insufficient for assessing real-world reliability of LLM agents
- Larger models aren't necessarily more robust to input variations, challenging the scaling hypothesis for reliability
- Architectural and training differences may play a more significant role in robustness than parameter count alone
The paper notes that the Qwen3 family's relative robustness might stem from its training methodology or architectural choices, though the researchers don't speculate on specific causes. This opens important research directions for understanding what training techniques promote semantic invariance.
Related Work: Agentic AI in Specialized Domains
The arXiv listing includes a related paper (arXiv:2603.12813) exploring agentic AI applications in chemical process flowsheet modeling—a domain where reasoning consistency is particularly critical. That work demonstrates how multi-agent systems combining GitHub Copilot with state-of-the-art LLMs like Claude Opus 4.6 can generate valid syntax for industrial simulation tools. The connection between these papers highlights the growing intersection of agentic AI and specialized technical domains where semantic invariance becomes operationally essential.
Limitations and Future Work
The study acknowledges several limitations:
- The evaluation covers only 19 problems across scientific domains
- Transformations are manually crafted rather than automatically generated
- The framework doesn't test compositional robustness (combinations of transformations)
- Only reasoning consistency is measured, not correctness of final answers
Future work could expand to more domains, develop automated transformation generators, and investigate the relationship between training data diversity and semantic invariance.
Bottom Line: Before deploying LLM agents in production systems, teams should test for semantic invariance—and not assume larger models will be more robust. The Qwen3-30B-A3B's performance suggests that with the right architecture and training, smaller models can achieve superior reasoning stability.



