CRYSTAL Benchmark Reveals Universal Step-Disorder in MLLMs: No Model Preserves >60% of Reasoning Steps in Correct Order
Researchers have introduced CRYSTAL (Clear Reasoning via Yielded Steps, Traceability and Logic), a diagnostic benchmark designed to evaluate multimodal reasoning through verifiable intermediate steps rather than just final answers. The benchmark contains 6,372 instances and reveals systematic failures in current multimodal large language models (MLLMs) that remain invisible when measuring only accuracy.
What the Researchers Built
CRYSTAL is constructed through a Delphi-inspired pipeline where four independent MLLMs generate reasoning trajectories for each problem. These trajectories are aggregated via semantic clustering and validated through human quality gates to create reference reasoning chains. This approach aims to capture diverse but valid reasoning paths rather than enforcing a single "correct" sequence.
The benchmark introduces two complementary metrics:
- Match F1: Scores step-level precision and recall via semantic similarity matching between model-generated steps and reference steps
- Ordered Match F1: Extends Match F1 by further penalizing disordered reasoning chains, requiring steps to appear in the correct sequence
Key Results
The evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals three systematic failures:

Most strikingly, no competitive model preserves more than 60% of matched steps in correct order, indicating that even when models generate correct reasoning components, they frequently arrange them illogically.
How It Works
The benchmark construction pipeline involves several stages:

- Problem Selection: 6,372 multimodal reasoning problems requiring step-by-step solutions
- Trajectory Generation: Four independent MLLMs generate reasoning chains for each problem
- Semantic Clustering: Generated steps are clustered based on semantic similarity to identify common reasoning patterns
- Human Validation: Quality gates ensure reference chains are logically sound and complete
- Metric Calculation: Both Match F1 and Ordered Match F1 are computed against these reference chains
For evaluation, models generate reasoning trajectories which are then compared to reference chains using semantic similarity measures. The Ordered Match F1 metric introduces an additional ordering constraint: matched steps must appear in the same relative order as in the reference chain.
Beyond Evaluation: Causal Process Reward and Curriculum
The researchers propose two training innovations based on CRYSTAL's findings:

Causal Process Reward (CPR): A multiplicative reward that couples answer correctness with step-level alignment. Unlike additive rewards that treat answer and process as separate components, CPR multiplies these factors, creating a stronger coupling between correct answers and proper reasoning.
CPR-Curriculum: A training approach that progressively increases reasoning difficulty. The curriculum starts with simpler reasoning tasks and gradually introduces more complex chains, allowing models to learn step-by-step reasoning incrementally.
In experiments using GRPO (Group Relative Policy Optimization), CPR-Curriculum achieves +32% improvement in Match F1 where additive reward strategies fail. This demonstrates that the approach can improve reasoning transparency without requiring manual step annotation during training.
Why It Matters
Current multimodal reasoning evaluation focuses overwhelmingly on final answer accuracy, which masks fundamental flaws in how models arrive at those answers. CRYSTAL reveals that even state-of-the-art MLLMs exhibit systematic reasoning deficiencies:
- Step omission (high precision, low recall) suggests models cherry-pick reasoning elements rather than constructing complete chains
- Disordered reasoning indicates models lack understanding of logical progression, even when they possess the necessary components
- Non-monotonic scaling challenges the assumption that larger models inherently reason better
The CPR training approach shows promise for addressing these issues without expensive manual annotation, potentially enabling more transparent and reliable multimodal reasoning systems.



