Brittlebench Framework Quantifies LLM Robustness, Finds Semantics-Preserving Perturbations Degrade Performance Up to 12%
AI ResearchScore: 84

Brittlebench Framework Quantifies LLM Robustness, Finds Semantics-Preserving Perturbations Degrade Performance Up to 12%

Researchers introduce Brittlebench, a framework to measure LLM sensitivity to prompt variations. Applying semantics-preserving perturbations to standard benchmarks degrades model performance by up to 12% and alters model rankings in 63% of cases.

21h ago·4 min read·4 views·via arxiv_ml
Share:

Brittlebench Framework Quantifies LLM Robustness, Finds Semantics-Preserving Perturbations Degrade Performance Up to 12%

A new research paper introduces Brittlebench, a theoretical framework and evaluation pipeline designed to quantify the brittleness of large language models—their sensitivity to semantics-preserving variations in prompts. The work, published on arXiv, argues that current static benchmarks overestimate real-world performance by failing to account for the noise and variability in human-generated queries.

What the Researchers Built

The core contribution is a framework that disentangles two sources of variance in LLM evaluations: data-induced difficulty (the inherent challenge of a task) and prompt-related variability (how a model's performance fluctuates based on how a question is phrased, even when the meaning is unchanged). The researchers define this sensitivity as brittleness.

To measure it, they built the Brittlebench pipeline. It applies controlled, semantics-preserving perturbations to established benchmark datasets. These perturbations mimic real-world imperfections like:

  • Minor typos or grammatical errors
  • Alternative phrasings of the same question
  • Changes in punctuation or formatting
  • Synonym substitution

The pipeline then measures the variance in model performance across these perturbed versions of the same underlying task.

Key Results

The researchers applied Brittlebench to evaluate both state-of-the-art open-weight and commercial frontier models. The results reveal significant brittleness across the board.

Figure 5: Comparison of chain-of-thought (CoT) and standard prompting for Claude 4.5 across benchmarks and input perturb

Max Performance Drop Model performance degraded by up to 12% when evaluated on perturbed vs. clean prompts. Ranking Instability A single perturbation altered the relative ranking of models in 63% of cases, directly impacting conclusions about which model performs best. Variance Attribution Semantics-preserving input perturbations accounted for up to 50% of the total performance variance for a given model on a given task.

The 12% degradation is not uniform; some models and tasks are far more sensitive than others. The finding that over half of performance variance can be attributed to prompt phrasing—not task difficulty—challenges the stability of current leaderboards.

How It Works

The Brittlebench methodology involves several steps:

Figure 2: (a): Model sensitivity to perturbation intensity. Llama3.1-8B (top) and Qwen3-8B (bottom) accuracy (%) aggrega

  1. Benchmark Selection: The pipeline starts with established benchmarks (e.g., MMLU, HellaSwag, GSM8K).
  2. Perturbation Generation: For each prompt in the benchmark, the system generates multiple variants that preserve semantic meaning. This is done through a rule-based and model-assisted approach to ensure the core query is unchanged.
  3. Model Evaluation: Multiple LLMs are evaluated on both the original and perturbed versions of the benchmark.
  4. Variance Decomposition: Using their theoretical framework, the researchers decompose the total variance in scores into components attributable to (a) the model, (b) the task/data point, and (c) the prompt variant. The portion attributed to the prompt variant is the quantified brittleness.

This approach moves beyond single-score benchmarking to produce a robustness profile for each model.

Why It Matters

Brittlebench provides a necessary corrective to the current evaluation paradigm. Static benchmarks, while useful for controlled comparisons, create a false sense of stability. A model that scores 85% on a clean MMLU test might see its effective performance drop into the 70s when faced with the messy, varied inputs of real-world deployment.

Figure 1: The Brittlebench meta-evaluation framework. We select widely used benchmarks and apply semantics-preserving pe

The finding that model rankings flip in 63% of cases with minor perturbations is particularly consequential for both academic research and commercial model selection. It suggests that declaring a "winner" based on a narrow set of clean prompts is statistically fragile.

For practitioners, this work underscores the importance of stress-testing models with varied prompts before deployment. For researchers, it provides a formal framework and tool (Brittlebench) to measure and report robustness alongside accuracy, pushing the field toward models that are not just capable, but also reliable.

AI Analysis

Brittlebench formalizes a critical but often anecdotal problem in LLM evaluation: performance is highly sensitive to prompt phrasing. The key technical insight is the variance decomposition framework, which cleanly separates the effect of the prompt from the effect of the task. This is more nuanced than simple adversarial attack benchmarks, as it focuses on *semantics-preserving* changes—the kind of innocent variability that occurs in normal use. The result that prompt variants can explain up to half of a model's performance variance is striking. It implies that a significant portion of what we call "model capability" is actually "model stability to rephrasing." This has direct implications for benchmarking. Reporting a single accuracy number on a static dataset is insufficient; the confidence interval or variance across plausible prompts should become a standard metric. For model developers, this research points to a clear training objective: reduce brittleness. Techniques like data augmentation with paraphrases, adversarial training with semantics-preserving perturbations, or consistency-based loss functions could directly target this weakness. Brittlebench provides the necessary tool to measure progress on this front.
Original sourcearxiv.org

Trending Now

More in AI Research

Browse more AI articles