Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 11–20 of 22 papers

Title	Date	Tasks	Status
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Jan 1, 2024	Spatio-Temporal Video GroundingVideo Grounding	—Unverified
Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding	Jun 20, 2021	Spatio-Temporal Video GroundingVideo Grounding	—Unverified
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding	Jan 1, 2023	Contrastive LearningSpatio-Temporal Video Grounding	—Unverified
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding	Jan 1, 2023	ObjectSpatio-Temporal Video Grounding	—Unverified
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding	Jan 28, 2025	object-detectionObject Detection	—Unverified
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding	Jul 2, 2022	Spatio-Temporal Video GroundingVideo Grounding	—Unverified
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding	Aug 16, 2020	DiversityObject	—Unverified
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability	Mar 18, 2025	Language ModelingLanguage Modelling	—Unverified
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding	Jan 1, 2025	Action UnderstandingSpatio-Temporal Video Grounding	—Unverified
STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding	Jan 1, 2021	ObjectSentence	—Unverified

Show:10 25 50

← PrevPage 2 of 3Next →

All datasets HC-STVG2 HC-STVG1 VidSTG

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	Val m_vIoU	40.2	—	Unverified
2	CG-STVG	Val m_vIoU	39.5	—	Unverified
3	STVGFormer	Val m_vIoU	38.7	—	Unverified
4	TubeDETR	Val m_vIoU	36.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	m_vIoU	39.1	—	Unverified
2	CG-STVG	m_vIoU	38.4	—	Unverified
3	TubeDETR	m_vIoU	32.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	Declarative m_vIoU	34.4	—	Unverified
2	CG-STVG	Declarative m_vIoU	34	—	Unverified
3	TubeDETR	Declarative m_vIoU	30.4	—	Unverified