EgoNormia: Benchmarking Physical Social Norm Understanding

2025-02-27Code Available1· sign in to hype

MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang

Code Available — Be the first to reproduce this paper.

Code

github.com/open-social-world/egonormia
OfficialIn papernone★ 12

Abstract

Human activity is moderated by norms. However, machines are often trained without explicit supervision on norm understanding and reasoning, particularly when norms are physically- or socially-grounded. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present \|\|, consisting of 1,853 challenging, multi-stage MCQ questions based on ego-centric videos of human interactions, evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 54\% on (versus a human bench of 92\%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation (RAG) method, it is possible to use to enhance normative reasoning in VLMs.

Tasks

Answer Generation Benchmarking RAG

EgoNormia: Benchmarking Physical Social Norm Understanding

Code

Abstract

Tasks

Reproductions