Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

2025-03-12Code Available0· sign in to hype

Qiji Zhou, Yifan Gong, Guangsheng Bao, Hongjie Qiu, Jinqiang Li, Xiangrong Zhu, Huajian Zhang, Yue Zhang

Code Available — Be the first to reproduce this paper.

Code

github.com/gongyifan-hash/cover-benchmark
Official★ 0

Abstract

Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce COVER (COunterfactual VidEo Reasoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments.

Tasks

All counterfactual Counterfactual Reasoning Logical Reasoning Video Understanding

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Code

Abstract

Tasks

Reproductions