DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness
AI ResearchScore: 75

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.

6h ago·5 min read·1 views·via arxiv_ai
Share:

DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness

A new benchmark called DEAF (Diagnostic Evaluation of Acoustic Faithfulness) reveals that Audio Multimodal Large Language Models (Audio MLLMs) predominantly rely on textual cues rather than genuinely processing acoustic signals, despite their strong performance on standard speech benchmarks. Published on arXiv on March 17, 2026, the research introduces a controlled framework to systematically test whether these models are actually "listening" or just performing text-based inference.

The Problem: Benchmarks That Don't Test Listening

Recent Audio MLLMs—models that process both audio and text—have shown impressive results on established speech understanding benchmarks. However, these benchmarks typically present audio and text that are semantically aligned. A model could achieve high scores by primarily processing the transcribed text or metadata, paying minimal attention to the acoustic properties of the audio itself.

The fundamental question the DEAF benchmark addresses is: Do these models genuinely understand the acoustic content of audio, or do they rely on textual shortcuts? This distinction matters for real-world applications where acoustic information is critical—detecting sarcasm through tone, identifying speakers in security contexts, or understanding instructions amid background noise.

What DEAF Measures: Three Acoustic Dimensions

The DEAF benchmark consists of over 2,700 carefully constructed "conflict stimuli" where the acoustic signal contradicts the textual content or prompt. This forces models to choose between trusting what they hear versus what they read. The conflicts span three core acoustic dimensions:

  1. Emotional Prosody: The same sentence (e.g., "This is great news") is spoken with contradictory emotions (e.g., a sad tone).
  2. Background Sounds: Audio contains salient background sounds (e.g., barking dogs, rain) that conflict with the spoken narrative.
  3. Speaker Identity: Audio features a speaker whose identity (e.g., gender, age) contradicts textual descriptions or prompts.

The Evaluation Framework: Disentangling Textual Bias

The researchers designed a multi-level evaluation framework that progressively increases textual influence to pinpoint where models fail. This structure helps disentangle two key failure modes:

  • Content-Driven Bias: When the semantic meaning of the spoken words overrides the acoustic signal.
  • Prompt-Induced Sycophancy: When a model's response is overly swayed by a leading or misleading user prompt, ignoring the audio evidence.

The framework tests four progressively challenging conditions:

  1. Acoustic-Only: Minimal textual influence; the prompt is neutral.
  2. Content Conflict: The spoken words semantically contradict the acoustic cue.
  3. Prompt Conflict: The user's text prompt contradicts the acoustic cue.
  4. Combined Conflict: Both the content and the prompt contradict the acoustic cue.

Key Results: A Pattern of Text Dominance

The study evaluated seven contemporary Audio MLLMs. The results show a consistent and clear pattern: models are sensitive to acoustic variations, but their final predictions are overwhelmingly driven by textual inputs.

Diagnostic metrics introduced in the paper quantify this reliance. The core metric is Acoustic Faithfulness Score (AFS), which measures how often a model's response aligns with the acoustic truth when it conflicts with text. Across the tested models, AFS scores were generally below 50%, and in many conflict scenarios, they dropped significantly lower.

For example, when presented with a sentence spoken in a sad tone but with happy content, most models would describe the emotion as "happy," following the text. Similarly, if a prompt stated the speaker was a child but the audio clearly featured an adult voice, models would often affirm the (incorrect) prompt.

The research concludes there is a "significant gap between high performance on standard speech benchmarks and genuine acoustic understanding."

How DEAF Works: Constructing the Conflict

The technical rigor of DEAF lies in its stimulus construction. To create the 2,700+ items, the researchers:

  • Curated or generated audio samples for the three acoustic dimensions.
  • Paired them with conflicting textual transcripts or prompts. For instance, an audio clip of someone speaking angrily would be paired with a transcript saying "I am very calm."
  • Ensured the conflicts were unambiguous to human listeners through validation studies.
    The evaluation protocol feeds both the audio and the (potentially conflicting) text into the model and analyzes its free-form responses to see which modality it favors.

Why This Benchmark Matters

DEAF shifts the evaluation paradigm for Audio MLLMs from "can they answer questions about speech?" to "do they actually process the sound?" This has several implications:

  • Model Development: It provides a diagnostic tool for developers to identify and rectify over-reliance on text in their training pipelines or architectures.
  • Trust and Robustness: For safety-critical applications (e.g., medical diagnosis from voice, security authentication), understanding a model's failure modes is essential. A model that ignores acoustic evidence is unreliable in novel or adversarial situations.
  • Scientific Understanding: It advances the field's understanding of multimodal integration in AI. The results suggest current models are not fusing audio and text in a balanced way but are using text as a primary crutch.

The authors posit that improving performance on DEAF will require architectural innovations, training objectives that penalize text-only shortcuts, and datasets that explicitly contain audio-text conflicts.

AI Analysis

The DEAF benchmark represents a crucial methodological advance in multimodal AI evaluation. For years, the field has struggled with the "clever Hans" problem in vision-language models, where models exploit textual correlations rather than learning genuine visual understanding. DEAF formally brings this rigorous, disentangled evaluation paradigm to the audio domain. Its multi-level framework is particularly elegant, as it doesn't just identify the problem but helps diagnose its source—whether it's inherent bias in the training data (content conflict) or an alignment/fine-tuning artifact that teaches models to be overly obedient to user prompts (prompt conflict). Practitioners building or deploying Audio MLLMs should pay close attention. High scores on benchmarks like SpeechQA or AudioCaps may be misleading indicators of a model's true capability in applications where paralinguistic features (tone, emotion, speaker ID) are key. The sub-50% Acoustic Faithfulness Scores indicate that current models are not robust for these tasks. The immediate takeaway is to test any audio model on DEAF-style conflicts before deployment in sensitive scenarios. This work also implicitly critiques standard training datasets and objectives. Most audio-text datasets are built for semantic alignment (e.g., captioning audio), which may teach models to treat audio as a noisy version of text. Future progress might require creating pre-training data with intentional audio-text mismatches or developing loss functions that specifically reward the model for correctly resolving modality conflicts, pushing it toward true multimodal integration.
Original sourcearxiv.org

Trending Now

More in AI Research

Browse more AI articles