Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification
Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/patrick-knab/motifOfficialIn paper★ 5
Abstract
Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations. Code available at https://github.com/patrick-knab/MoTIFgithub.com/patrick-knab/MoTIF.