Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models
Yuxiang Lin, Jingdong Sun, Zhi-Qi Cheng, Jue Wang, Haomin Liang, Zebang Cheng, Yifei Dong, Jun-Yan He, Xiaojiang Peng, Xian-Sheng Hua
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/lum1104/eibenchOfficialIn paperpytorch★ 0
- github.com/Lum1104/MER-Factorynone★ 85
Abstract
Most existing emotion analysis emphasizes which emotion arises (e.g., happy, sad, angry) but neglects the deeper why. We propose Emotion Interpretation (EI), focusing on causal factors-whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)-that drive emotional responses. Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling. To facilitate EI research, we present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps-especially for more intricate scenarios-underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at: https://github.com/Lum1104/EIBench, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| EIBench | Qwen-VL-Chat | Recall | 26.45 | — | Unverified |
| EIBench | Claude-3-haiku | Recall | 63.24 | — | Unverified |
| EIBench | LLaVA-1.5 (13B) | Recall | 54.37 | — | Unverified |
| EIBench | LLaVA-NEXT (13B) | Recall | 54.33 | — | Unverified |
| EIBench | Claude-3-sonnet | Recall | 54.1 | — | Unverified |
| EIBench | LLaVA-NEXT (7B) | Recall | 53.82 | — | Unverified |
| EIBench | MiniGPT-v2 | Recall | 52.89 | — | Unverified |
| EIBench | ChatGPT-4o | Recall | 49.99 | — | Unverified |
| EIBench | Video-LLaVA | Recall | 49.26 | — | Unverified |
| EIBench | LLaVA-NEXT (34B) | Recall | 49.03 | — | Unverified |
| EIBench | ChatGPT-4V | Recall | 46.86 | — | Unverified |
| EIBench | Otter | Recall | 42.81 | — | Unverified |
| EIBench | Qwen-vl-plus | Recall | 31 | — | Unverified |
| EIBench (complex) | ChatGPT-4o | Recall | 39.27 | — | Unverified |
| EIBench (complex) | LLaVA-NEXT (13B) | Recall | 39.16 | — | Unverified |
| EIBench (complex) | LLaVA-NEXT (7B) | Recall | 38.71 | — | Unverified |
| EIBench (complex) | LLaVA-1.5 (13B) | Recall | 38.1 | — | Unverified |
| EIBench (complex) | LLaVA-NEXT (34B) | Recall | 35.37 | — | Unverified |
| EIBench (complex) | MiniGPT-v2 | Recall | 35.1 | — | Unverified |
| EIBench (complex) | Video-LLaVA | Recall | 30.9 | — | Unverified |
| EIBench (complex) | ChatGPT-4V | Recall | 28 | — | Unverified |
| EIBench (complex) | Otter | Recall | 27.9 | — | Unverified |
| EIBench (complex) | Claude-3-haiku | Recall | 24 | — | Unverified |
| EIBench (complex) | Qwen-VL-Chat | Recall | 22 | — | Unverified |
| EIBench (complex) | Claude-3-sonnet | Recall | 21.37 | — | Unverified |
| EIBench (complex) | Qwen-vl-plus | Recall | 20.37 | — | Unverified |