SOTAVerified

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

2020-08-01ECCV 2020Code Available1· sign in to hype

Shaoxiang Chen, Yu-Gang Jiang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works. In this paper, we propose a novel TALL method which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. We also design a convolutional network with cross-channel communication mechanism to further encourage the information passing between the visual and textual modalities. Finally, we propose a loss function that enforces alignment of the visual representation of the localized activity and the sentence representation, so that the model can predict more accurate temporal boundaries. We evaluated our proposed method on two popular benchmark datasets: Charades-STA and ActivityNet Captions, and achieved state-of-the-art performances on both datasets. Code is available at https://github.com/forwchen/HVTG.

Tasks

Reproductions