Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

2025-03-03Unverified0· sign in to hype

Ramanathan Rajendiran, Debaditya Roy, Basura Fernando

Unverified — Be the first to reproduce this paper.

Abstract

Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: long-term future narration generation, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.

Tasks

Action Anticipation Decision Making Language Modeling Language Modelling Video Retrieval

Learning to Generate Long-term Future Narrations Describing Activities of Daily Living

Abstract

Tasks

Reproductions