Spatio-Temporal Video Grounding

Spatio-temporal video grounding is a computer vision and natural language processing (NLP) task that involves linking textual descriptions to specific spatio-temporal regions or moments in a video. In other words, it aims to determine which parts of a video correspond to a given textual query or description. This task is essential for various applications, including video summarization, content-based video retrieval, video captioning, and more.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–22 of 22 papers

Title	Date	Tasks	Status	Hype
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	Nov 22, 2023	BenchmarkingPhrase Grounding	CodeCode Available	2
Context-Guided Spatio-Temporal Video Grounding	Jan 3, 2024	ObjectSpatio-Temporal Video Grounding	CodeCode Available	2
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences	Jan 19, 2020	FormObject	CodeCode Available	1
Human-centric Spatio-Temporal Video Grounding With Visual Transformers	Nov 10, 2020	Referring ExpressionSentence	CodeCode Available	1
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding	Feb 16, 2025	AttributeObject	CodeCode Available	1
TubeDETR: Spatio-Temporal Video Grounding with Transformers	Mar 30, 2022	DecoderLanguage-Based Temporal Localization	CodeCode Available	1
Large-scale Pre-training for Grounded Video Caption Generation	Mar 13, 2025	Caption Generation	CodeCode Available	1
Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding	Sep 27, 2022	DecoderSpatio-Temporal Video Grounding	CodeCode Available	1
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding	Jul 6, 2022	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Dec 31, 2023	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding	Jan 1, 2024	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding	Jan 1, 2023	Contrastive LearningSpatio-Temporal Video Grounding	—Unverified	0
Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding	Jan 1, 2023	ObjectSpatio-Temporal Video Grounding	—Unverified	0
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding	Jan 28, 2025	object-detectionObject Detection	—Unverified	0
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding	Jul 2, 2022	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding	Aug 16, 2020	DiversityObject	—Unverified	0
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability	Mar 18, 2025	Language ModelingLanguage Modelling	—Unverified	0
STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding	Jan 1, 2025	Action UnderstandingSpatio-Temporal Video Grounding	—Unverified	0
STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding	Jan 1, 2021	ObjectSentence	—Unverified	0
Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding	Jun 20, 2021	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
Guided Attention for Interpretable Motion Captioning	Oct 11, 2023	Action LocalizationMotion Captioning	CodeCode Available	0
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions	Mar 29, 2023	Representation LearningSpatio-Temporal Video Grounding	CodeCode Available	0

Show:10 25 50

All datasets HC-STVG2 HC-STVG1 VidSTG

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	Val m_vIoU	40.2	—	Unverified
2	CG-STVG	Val m_vIoU	39.5	—	Unverified
3	STVGFormer	Val m_vIoU	38.7	—	Unverified
4	TubeDETR	Val m_vIoU	36.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	m_vIoU	39.1	—	Unverified
2	CG-STVG	m_vIoU	38.4	—	Unverified
3	TubeDETR	m_vIoU	32.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	TA-STVG	Declarative m_vIoU	34.4	—	Unverified
2	CG-STVG	Declarative m_vIoU	34	—	Unverified
3	TubeDETR	Declarative m_vIoU	30.4	—	Unverified