Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A computer screen displays side-by-side photos of a mountain landscape and a modern living room, with a large red X…
AI ResearchScore: 60

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time

Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.

·21h ago·3 min read··12 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_cvSingle Source
How well do frontier multimodal models perform on the Visual Aesthetic Benchmark for aesthetic judgment?

Frontier MLLMs achieve only 26.5% accuracy on the Visual Aesthetic Benchmark (VAB), far below the 68.9% of human experts. VAB tests comparative aesthetic judgment across 400 tasks with 10 expert judges per task.

TL;DR

Top MLLM scores 26.5% on VAB vs. 68.9% for humans. · Scalar scores misrepresent comparative aesthetic preferences. · Fine-tuning 35B model rivals 397B open-weight model.

A new benchmark reveals frontier MLLMs judge aesthetics correctly only 26.5% of the time. The Visual Aesthetic Benchmark (VAB) shows a 42.4-point gap versus human experts.

Key facts

  • Top MLLM scores 26.5% on VAB vs. 68.9% for humans.
  • VAB includes 400 tasks and 1,195 images.
  • Labels from 10 expert judges per task.
  • Fine-tuning 35B model on 2,000 examples matches 397B model.
  • 20 frontier MLLMs and 6 reward models evaluated.

Key Takeaways

  • Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%.
  • Fine-tuning bridges the gap.

The Problem with Scalar Scores

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time The Problem with Scalar Scores

Most aesthetic evaluation systems reduce judgment to a single scalar score per image. The VAB authors first tested this approach with eight expert annotators: score-derived rankings aligned poorly with the same annotators' direct comparisons. Direct ranking yielded substantially higher inter-annotator agreement on best- and worst-image labels [According to Visual Aesthetic Benchmark].

VAB Design and Results

VAB casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. The benchmark contains 400 tasks and 1,195 images spanning fine art, photography, and illustration. Labels come from the consensus of 10 independent expert judges per task [per the arXiv preprint].

Evaluating 20 frontier MLLMs and six dedicated visual-quality reward models, the strongest system identified both the best and worst image correctly across three random permutations in only 26.5% of tasks. Human experts achieved 68.9% accuracy on the same tasks. The gap is clear and measurable.

Fine-Tuning Transfer

EssayJudge: A Multi-Granular Benchmark for Assessing Automa…

Fine-tuning a 35B-parameter model on 2,000 expert examples brought its accuracy close to a 397B-parameter open-weight model. This suggests the comparative signal in VAB is transferable. The result implies smaller, specialized models can approach frontier performance with targeted data [According to Visual Aesthetic Benchmark].

What This Means for Deployment

Multimodal large language models are routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require explicit aesthetic judgment. The VAB results expose a structural weakness in current approaches. Scalar scoring fundamentally misrepresents comparative preference, and even the best models fall far short of expert consensus.

What to watch

Watch for open-weight model fine-tuned on VAB data to approach human-level performance in the next 6 months. Also track whether commercial API providers (OpenAI, Google, Anthropic) publish VAB scores or adopt comparative evaluation for aesthetic tasks.


Sources cited in this article

  1. Visual Aesthetic Benchmark
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The VAB paper exposes a fundamental flaw in how aesthetic judgment is operationalized in current multimodal systems. The scalar-score approach, inherited from image quality assessment, fails to capture comparative preference. This isn't just a benchmark gap — it's a structural mismatch between the evaluation metric and the task. The transferability finding is the most actionable result: 2,000 expert comparisons can boost a 35B model to near-frontier performance. This suggests the bottleneck is not model scale but data quality and task formulation. Expect VAB to become a standard eval for aesthetic tasks, similar to how MMLU became the de facto reasoning benchmark. The 26.5% top score is a wake-up call for any team deploying MLLMs for visual curation or generation.
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all