SOTAVerified

Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Papers

Showing 110 of 22 papers

TitleStatusHype
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability0
Large-scale Pre-training for Grounded Video Caption GenerationCode1
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingCode1
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding0
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding0
Context-Guided Spatio-Temporal Video GroundingCode2
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Guided Attention for Interpretable Motion CaptioningCode0
Show:102550
← PrevPage 1 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1TA-STVGVal m_vIoU40.2Unverified
2CG-STVGVal m_vIoU39.5Unverified
3STVGFormerVal m_vIoU38.7Unverified
4TubeDETRVal m_vIoU36.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGm_vIoU39.1Unverified
2CG-STVGm_vIoU38.4Unverified
3TubeDETRm_vIoU32.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGDeclarative m_vIoU34.4Unverified
2CG-STVGDeclarative m_vIoU34Unverified
3TubeDETRDeclarative m_vIoU30.4Unverified