SOTAVerified

Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Papers

Showing 1120 of 22 papers

TitleStatusHype
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding0
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding0
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding0
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding0
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding0
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability0
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding0
STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding0
Show:102550
← PrevPage 2 of 3Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1TA-STVGVal m_vIoU40.2Unverified
2CG-STVGVal m_vIoU39.5Unverified
3STVGFormerVal m_vIoU38.7Unverified
4TubeDETRVal m_vIoU36.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGm_vIoU39.1Unverified
2CG-STVGm_vIoU38.4Unverified
3TubeDETRm_vIoU32.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGDeclarative m_vIoU34.4Unverified
2CG-STVGDeclarative m_vIoU34Unverified
3TubeDETRDeclarative m_vIoU30.4Unverified