Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

2025-04-10Code Available2· sign in to hype

Yuxiang Lin, Jingdong Sun, Zhi-Qi Cheng, Jue Wang, Haomin Liang, Zebang Cheng, Yifei Dong, Jun-Yan He, Xiaojiang Peng, Xian-Sheng Hua

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/lum1104/eibench
OfficialIn paperpytorch★ 0
github.com/Lum1104/MER-Factory
none★ 85

Abstract

Most existing emotion analysis emphasizes which emotion arises (e.g., happy, sad, angry) but neglects the deeper why. We propose Emotion Interpretation (EI), focusing on causal factors-whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)-that drive emotional responses. Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling. To facilitate EI research, we present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps-especially for more intricate scenarios-underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at: https://github.com/Lum1104/EIBench, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.

Tasks

Emotion Interpretation Emotion Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
EIBench	Qwen-VL-Chat	Recall	26.45	—	Unverified
EIBench	Claude-3-haiku	Recall	63.24	—	Unverified
EIBench	LLaVA-1.5 (13B)	Recall	54.37	—	Unverified
EIBench	LLaVA-NEXT (13B)	Recall	54.33	—	Unverified
EIBench	Claude-3-sonnet	Recall	54.1	—	Unverified
EIBench	LLaVA-NEXT (7B)	Recall	53.82	—	Unverified
EIBench	MiniGPT-v2	Recall	52.89	—	Unverified
EIBench	ChatGPT-4o	Recall	49.99	—	Unverified
EIBench	Video-LLaVA	Recall	49.26	—	Unverified
EIBench	LLaVA-NEXT (34B)	Recall	49.03	—	Unverified
EIBench	ChatGPT-4V	Recall	46.86	—	Unverified
EIBench	Otter	Recall	42.81	—	Unverified
EIBench	Qwen-vl-plus	Recall	31	—	Unverified
EIBench (complex)	ChatGPT-4o	Recall	39.27	—	Unverified
EIBench (complex)	LLaVA-NEXT (13B)	Recall	39.16	—	Unverified
EIBench (complex)	LLaVA-NEXT (7B)	Recall	38.71	—	Unverified
EIBench (complex)	LLaVA-1.5 (13B)	Recall	38.1	—	Unverified
EIBench (complex)	LLaVA-NEXT (34B)	Recall	35.37	—	Unverified
EIBench (complex)	MiniGPT-v2	Recall	35.1	—	Unverified
EIBench (complex)	Video-LLaVA	Recall	30.9	—	Unverified
EIBench (complex)	ChatGPT-4V	Recall	28	—	Unverified
EIBench (complex)	Otter	Recall	27.9	—	Unverified
EIBench (complex)	Claude-3-haiku	Recall	24	—	Unverified
EIBench (complex)	Qwen-VL-Chat	Recall	22	—	Unverified
EIBench (complex)	Claude-3-sonnet	Recall	21.37	—	Unverified
EIBench (complex)	Qwen-vl-plus	Recall	20.37	—	Unverified

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

Code

Abstract

Tasks

Benchmark Results

Reproductions