Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Ethan Mollick Criticizes GDPval-AA Benchmark as 'Not Good'

Ethan Mollick Criticizes GDPval-AA Benchmark as 'Not Good'

AI researcher Ethan Mollick criticized the GDPval-AA benchmark, stating that using Gemini 3.1 to judge other models on public GDPval questions 'tells us nothing.' He called for it to stop being reported.

GAla Smith & AI Research Desk·5h ago·4 min read·9 views·AI-Generated
Share:
Ethan Mollick Calls Out GDPval-AA Benchmark as Uninformative

In a recent public critique, Ethan Mollick, a professor at the Wharton School and a prominent voice on AI adoption, singled out a specific AI model benchmark for being uninformative. While praising the benchmarking platform Artificial Analysis for its overall transparency, Mollick stated that the GDPval-AA benchmark is "not a good benchmark and needs to stop being reported."

What Happened

OpenAI's GDPval benchmark measures when AI can do your job

On X (formerly Twitter), Mollick explained the core issue with the GDPval-AA benchmark. The benchmark uses Google's Gemini 3.1 model to judge the outputs of other AI models on questions from the public GDPval dataset. According to Mollick, this methodology "tells us nothing" meaningful about the actual performance or capabilities of the models being evaluated.

The critique centers on a fundamental problem in AI evaluation: using one model (an "AI judge") to score another. If the judge model has specific biases, limitations, or simply a different interpretation of quality, its scores may reflect the judge's characteristics more than the capabilities of the model being tested. When the test questions are also public, the benchmark becomes even less reliable for measuring genuine reasoning or knowledge.

Context: The Benchmarking Landscape

The call highlights the ongoing challenges and debates within the AI community regarding how to properly evaluate large language models (LLMs). As models become more capable, creating benchmarks that are robust, unbiased, and truly measure intelligence or usefulness is increasingly difficult.

Platforms like Artificial Analysis aggregate scores from multiple public benchmarks (like MMLU, GPQA, and HumanEval) to provide a composite view of model performance. The GDPval-AA appears to be one component of this aggregation. Mollick's criticism suggests that including this particular metric may distort the overall picture presented to developers and researchers who rely on these scores for decision-making.

The Implication for Practitioners

OpenAI GDPval: A New AI Benchmark for the Future of Work

For AI engineers and technical leaders, benchmark scores are a key data point for model selection, whether for prototyping, production use, or research comparison. A flawed benchmark can lead to suboptimal choices or misallocated resources. Mollick's public call to stop reporting GDPval-AA is a direct recommendation to the community to refine its evaluation criteria and focus on metrics that provide actionable, trustworthy signals.

gentic.news Analysis

Mollick's critique taps into a critical and persistent tension in AI development: the benchmark arms race versus real-world utility. As we've covered extensively, including in our analysis of the Vibe-Eval benchmark's attempt to measure "vibes" and the controversies surrounding data contamination in training sets, the community is grappling with evaluation methodologies that keep pace with model capabilities. When a benchmark like GDPval-AA relies on another LLM as a judge on public data, it risks measuring benchmark-specific optimization rather than generalizable skill.

This aligns with a broader trend of increased scrutiny on evaluation integrity. In recent months, we've seen similar discussions around the SWE-bench coding benchmark and its "verified" versus "unverified" scores, where the method of answer verification drastically changes the leaderboard. Mollick, whose work focuses on how professionals actually integrate AI into workflows, is effectively arguing for benchmarks that correlate with practical performance, not just circular, model-on-model grading.

The entity relationship here is also notable. Artificial Analysis (the benchmarking platform) aggregates data, while Google's Gemini is the judge model in question. Mollick's criticism isn't of Gemini's capabilities per se, but of its use as an evaluation tool in this specific, flawed configuration. This serves as a reminder that even outputs from top-tier models should not be uncritically treated as ground truth, especially when they form the basis of competitive scoring.

Frequently Asked Questions

What is the GDPval-AA benchmark?

GDPval-AA is a benchmark score reported by the platform Artificial Analysis. It uses Google's Gemini 3.1 model to evaluate the responses of other AI models to questions from the public GDPval dataset.

Why does Ethan Mollick say GDPval-AA is a bad benchmark?

Mollick argues it "tells us nothing" because it uses one AI model (Gemini 3.1) to judge others on public questions. This creates a circular evaluation where the scores may reflect the judge model's biases and the benchmark's susceptibility to contamination, rather than measuring true model capability or reasoning.

What is Artificial Analysis?

Artificial Analysis is a platform that aggregates performance scores from multiple public AI benchmarks to provide composite rankings and comparisons of large language models (LLMs).

How should developers evaluate AI models if some benchmarks are flawed?

Developers should rely on a suite of benchmarks, prioritize those with robust, human-verified evaluation methods, and—critically—conduct their own task-specific evaluations on private datasets that reflect their actual use cases. No single public benchmark score should be the sole criterion for model selection.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Mollick's critique is less about a single flawed metric and more a symptom of a systemic issue in LLM evaluation: the search for scalable, automated assessment is creating benchmarks that are easy to run but poor proxies for real capability. The GDPval-AA benchmark exemplifies the "LLM-as-judge" paradigm, which is computationally cheap but philosophically fraught. The judge LLM imposes its own latent understanding of correctness, style, and reasoning onto the eval. If the judge model has a particular weakness or bias (e.g., toward verbose answers, specific phrasing, or knowledge cut-off artifacts), it systematically penalizes or rewards models accordingly, regardless of objective truth or utility. This connects directly to our previous reporting on the **LLM Judge Agreement problem**, where different judge models (GPT-4, Claude 3, Gemini) frequently disagree on scoring the same model output. Using Gemini 3.1 as a single judge for a published benchmark introduces a **single point of bias** into the competitive landscape. For practitioners, the takeaway is to treat any benchmark using an LLM judge on public data with extreme skepticism. It may indicate how well a model's output aligns with Gemini's preferences, but not necessarily with correctness or practical value. The broader trend, as seen with the push for more **human-in-the-loop evaluation** in benchmarks like LiveCodeBench or the rigorous verification in SWE-bench, is a return to ground-truth-based assessment, even if it's more expensive. Mollick's call to simply stop reporting GDPval-AA is a pragmatic one: removing noisy signals cleans up the data environment for everyone. In the long run, the field needs evaluation suites that are adversarial—actively designed to foil surface-level pattern matching—and that prioritize tasks where the answer is verifiably right or wrong, not just stylistically pleasing to another AI.
Enjoyed this article?
Share:

Related Articles

More in Opinion & Analysis

View all