MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

2025-10-10Code Available0· sign in to hype

Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/dmmm1997/momentseg
OfficialIn paper★ 23

Abstract

Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated [FIND] token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Code

Abstract

Reproductions