Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

2026-03-05Unverified0· sign in to hype

Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle

Unverified — Be the first to reproduce this paper.

Abstract

Foundation models are used for many real-world applications involving language generation from temporally-ordered multimodal events. In this work, we study the ability of models to identify the most important sub-events in a video, which is a fundamental prerequisite for narrating or summarizing multimodal events. Specifically, we focus on football games and evaluate models on their ability to distinguish between important and non-important sub-events in a game. To this end, we construct a new dataset by leveraging human preferences for importance implicit in football game highlight reels, without any additional annotation costs. Using our dataset, we compare several state-of-the-art multimodal models and show that they are not far from chance level performance. Analyses of models beyond standard evaluation metrics reveal their tendency to rely on a single dominant modality and their ineffectiveness in synthesizing necessary information from multiple sources. Our findings underline the importance of modular architectures that can handle sample-level heterogeneity in multimodal data and the need for complementary training procedures that can maximize cross-modal synergy.

Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Abstract

Reproductions