HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

2024-08-30Code Available1· sign in to hype

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu

Code Available — Be the first to reproduce this paper.

Code

github.com/joslefaure/HERMES
Officialpytorch★ 38

Abstract

Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.

Tasks

Form Video Classification zero-shot long video breakpoint-mode question answering zero-shot long video global-model question answering zero-shot long video global-mode question answering zero-shot long video question answering

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
Breakfast	HERMES	Accuracy (%)	95.2	—	Unverified
COIN	HERMES	Accuracy (%)	93.5	—	Unverified

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

Code

Abstract

Tasks

Benchmark Results

Reproductions