SOTAVerified

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

2026-03-07Unverified0· sign in to hype

Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present Video-EM, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as episodic event construction followed by memory refinement. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to orchestrate off-the-shelf tools: it first localizes query-relevant moments via multi-grained semantic matching, then groups and segments them into temporally coherent events, and finally encodes each event as a grounded episodic memory with explicit temporal indices and spatio-temporal cues (capturing when, where, what, and involved entities). To further suppress verbosity and noise from imperfect upstream signals, Video-EM integrates a reasoning-driven self-reflection loop that iteratively verifies evidence sufficiency and cross-event consistency, removes redundancy, and adaptively adjusts event granularity. The outcome is a compact yet reliable event timeline -- a minimal but sufficient episodic memory set that can be directly consumed by existing Video-LLMs without additional training or architectural changes.

Reproductions