SOTAVerified

Moment Retrieval

Moment retrieval can de defined as the task of "localizing moments in a video given a user query".

Description from: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Image credit: QVHIGHLIGHTS: Detecting Moments and Highlights in Videos via Natural Language Queries

Papers

Showing 150 of 132 papers

TitleStatusHype
DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding0
Retrieval Augmented Generation Evaluation for Health Documents0
Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection0
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking0
TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long VideosCode1
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval0
Moment of Untruth: Dealing with Negative Queries in Video Moment RetrievalCode0
LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight DetectionCode1
Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection0
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models0
The Devil is in the Spurious Correlation: Boosting Moment Retrieval via Temporal Dynamic Learning0
A Flexible and Scalable Framework for Video Moment SearchCode1
Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight DetectionCode1
DTOS: Dynamic Time Object Sensing with Large Multimodal ModelCode0
Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2DCode0
Length-Aware DETR for Robust Moment RetrievalCode1
DAVE: Diverse Atomic Visual Elements Dataset with High Representation of Vulnerable Road Users in Complex and Unpredictable Environments0
FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal GroundingCode1
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning0
Agent-based Video Trimming0
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment RetrievalCode1
Vid-Morp: Video Moment Retrieval Pretraining from Unlabeled Videos in the WildCode1
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment RetrievalCode0
Number it: Temporal Grounding Videos like Flipping MangaCode2
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded TuningCode2
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video UnderstandingCode1
Saliency-Guided DETR for Moment Retrieval and Highlight DetectionCode1
Show and Guide: Instructional-Plan Grounded Vision and Language ModelCode0
EAGLE: Egocentric AGgregated Language-video Engine0
Language-based Audio Moment RetrievalCode3
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching0
QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval0
Disentangle and denoise: Tackling context misalignment for video moment retrieval0
Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight DetectionCode3
SLVideo: A Sign Language Video Moment Retrieval Framework0
Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment RetrievalCode2
Multi-sentence Video Grounding for Long Video Generation0
EA-VTR: Event-Aware Video-Text Retrieval0
TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise QueriesCode0
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal GroundingCode0
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment RetrievalCode2
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval0
2DP-2MRC: 2-Dimensional Pointer-based Machine Reading Comprehension Method for Multimodal Moment Retrieval0
Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels0
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
Context-Enhanced Video Moment Retrieval with Large Language Models0
MLP: Motion Label Prior for Temporal Sentence Localization in Untrimmed 3D Human MotionsCode1
Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight DetectionCode1
UniMD: Towards Unifying Moment Retrieval and Temporal Action DetectionCode2
R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal GroundingCode0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1UnLoc-LR@1 IoU=0.566.1Unverified
2UnLoc-BR@1 IoU=0.564.5Unverified
3DenoiseLocR@1 IoU=0.559.27Unverified
4SG-DETR (w/ PT)mAP58.8Unverified
5SG-DETRmAP54.1Unverified
6LLaVA-MRmAP52.73Unverified
7FlashVTGmAP52Unverified
8InternVideo2-6BmAP49.24Unverified
9CG-DETR (w/ PT)mAP47.97Unverified
10VideoLights-B-ptmAP47.94Unverified
#ModelMetricClaimedVerifiedStatus
1SG-DETR (w/ PT)R@1 IoU=0.571.1Unverified
2LLaVA-MRR@1 IoU=0.570.65Unverified
3FlashVTGR@1 IoU=0.570.32Unverified
4SG-DETRR@1 IoU=0.570.2Unverified
5InternVideo2-6BR@1 IoU=0.570.03Unverified
6InternVideo2-1BR@1 IoU=0.568.36Unverified
7VideoChat-T (FT)R@1 IoU=0.567.1Unverified
8UniMD+Sync.R@1 IoU=0.563.98Unverified
9LD-DETRR@1 IoU=0.562.58Unverified
10VideoLights-B-ptR@1 IoU=0.561.96Unverified