HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding
HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.




























