SOTAVerified

Moment Retrieval

Moment retrieval can de defined as the task of "localizing moments in a video given a user query".

Description from: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Image credit: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Papers

Showing 5175 of 132 papers

TitleStatusHype
Video Moment Retrieval from Text Queries via Single Frame AnnotationCode1
Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight DetectionCode1
Deconfounded Video Moment Retrieval with Causal InterventionCode1
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment RetrievalCode1
Detecting Moments and Highlights in Videos via Natural Language QueriesCode1
Cross-modal Contrastive Learning with Asymmetric Co-attention Network for Video Moment RetrievalCode0
Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2DCode0
Boundary-Denoising for Video Activity LocalizationCode0
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in VideosCode0
DTOS: Dynamic Time Object Sensing with Large Multimodal ModelCode0
Exploring Temporal Concurrency for Video-Language Representation LearningCode0
Going for GOAL: A Resource for Grounded Football CommentariesCode0
Improving Video Corpus Moment Retrieval with Partial Relevance EnhancementCode0
Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics DomainsCode0
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment RetrievalCode0
Modal-specific Pseudo Query Generation for Video Corpus Moment RetrievalCode0
Moment of Untruth: Dealing with Negative Queries in Video Moment RetrievalCode0
MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple DistractorsCode0
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal GroundingCode0
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal GroundingCode0
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal GroundingCode0
Show and Guide: Instructional-Plan Grounded Vision and Language ModelCode0
SimVTP: Simple Video Text Pre-training with Masked AutoencodersCode0
Towards Diverse Temporal Grounding under Single Positive LabelsCode0
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise QueriesCode0
Show:102550
← PrevPage 3 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1UnLoc-LR@1 IoU=0.566.1Unverified
2UnLoc-BR@1 IoU=0.564.5Unverified
3DenoiseLocR@1 IoU=0.559.27Unverified
4SG-DETR (w/ PT)mAP58.8Unverified
5SG-DETRmAP54.1Unverified
6LLaVA-MRmAP52.73Unverified
7FlashVTGmAP52Unverified
8InternVideo2-6BmAP49.24Unverified
9CG-DETR (w/ PT)mAP47.97Unverified
10VideoLights-B-ptmAP47.94Unverified
#ModelMetricClaimedVerifiedStatus
1SG-DETR (w/ PT)R@1 IoU=0.571.1Unverified
2LLaVA-MRR@1 IoU=0.570.65Unverified
3FlashVTGR@1 IoU=0.570.32Unverified
4SG-DETRR@1 IoU=0.570.2Unverified
5InternVideo2-6BR@1 IoU=0.570.03Unverified
6InternVideo2-1BR@1 IoU=0.568.36Unverified
7VideoChat-T (FT)R@1 IoU=0.567.1Unverified
8UniMD+Sync.R@1 IoU=0.563.98Unverified
9LD-DETRR@1 IoU=0.562.58Unverified
10VideoLights-B-ptR@1 IoU=0.561.96Unverified