SOTAVerified

Video Narrative Grounding is the task of linking video narratives to specific video segments. The input is a video with a text description (the narrative) and the positions of certain nouns marked. For each marked noun, the method must output a segmentation mask for the object it refers to, in each video frame.

Source: Connecting Vision and Language with Video Localized Narratives

Video Narrative Grounding

Papers