SOTAVerified

Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Papers

Showing 122 of 22 papers

TitleStatusHype
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsCode2
Context-Guided Spatio-Temporal Video GroundingCode2
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form SentencesCode1
Human-centric Spatio-Temporal Video Grounding With Visual TransformersCode1
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video GroundingCode1
TubeDETR: Spatio-Temporal Video Grounding with TransformersCode1
Large-scale Pre-training for Grounded Video Caption GenerationCode1
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video GroundingCode1
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding0
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding0
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding0
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding0
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding0
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding0
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability0
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding0
STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding0
Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding0
Guided Attention for Interpretable Motion CaptioningCode0
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsCode0
Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1TA-STVGVal m_vIoU40.2Unverified
2CG-STVGVal m_vIoU39.5Unverified
3STVGFormerVal m_vIoU38.7Unverified
4TubeDETRVal m_vIoU36.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGm_vIoU39.1Unverified
2CG-STVGm_vIoU38.4Unverified
3TubeDETRm_vIoU32.4Unverified
#ModelMetricClaimedVerifiedStatus
1TA-STVGDeclarative m_vIoU34.4Unverified
2CG-STVGDeclarative m_vIoU34Unverified
3TubeDETRDeclarative m_vIoU30.4Unverified