Video Grounding

Video grounding is the task of linking spoken language descriptions to specific video segments. In video grounding, the model is given a video and a natural language description, such as a sentence or a caption, and its goal is to identify the specific segment of the video that corresponds to the description. This can involve tasks such as localizing the objects or actions mentioned in the description within the video, or associating a specific time interval with the description.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 81–90 of 114 papers

Title	Date	Tasks	Status	Hype
STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding	Jul 6, 2022	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
Gaussian Kernel-based Cross Modal Network for Spatio-Temporal Video Grounding	Jul 2, 2022	Spatio-Temporal Video GroundingVideo Grounding	—Unverified	0
Animal Kingdom: A Large and Diverse Dataset for Animal Behavior Understanding	Apr 18, 2022	Action RecognitionAnimal Action Recognition	CodeCode Available	1
Position-aware Location Regression Network for Temporal Video Grounding	Apr 12, 2022	Positionregression	—Unverified	0
TubeDETR: Spatio-Temporal Video Grounding with Transformers	Mar 30, 2022	DecoderLanguage-Based Temporal Localization	CodeCode Available	1
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection	Mar 23, 2022	DecoderHighlight Detection	CodeCode Available	2
End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding	Mar 15, 2022	DescriptiveRepresentation Learning	—Unverified	0
Multi-Scale Self-Contrastive Learning with Hard Negative Mining for Weakly-Supervised Query-based Video Grounding	Mar 8, 2022	Contrastive LearningSentence	—Unverified	0
Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos	Jan 25, 2022	Natural Language QueriesSentence	CodeCode Available	1
Unsupervised Temporal Video Grounding with Deep Semantic Clustering	Jan 14, 2022	ClusteringSentence	—Unverified	0

Show:10 25 50

← PrevPage 9 of 12Next →

All datasets QVHighlights MAD

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	InternVideo2-6B	R@1,IoU=0.7	56.45	—	Unverified
2	InternVideo2-1B	R@1,IoU=0.7	54.45	—	Unverified
3	LLMEPET	R@1,IoU=0.7	49.94	—	Unverified
4	QD-DETR	R@1,IoU=0.7	44.98	—	Unverified
5	DiffusionVMR	R@1,IoU=0.7	44.49	—	Unverified
6	UMT	R@1,IoU=0.7	41.18	—	Unverified
7	Moment-DETR	R@1,IoU=0.7	33.02	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	DeCafNet	R@1,IoU=0.1	13.25	—	Unverified
2	DenoiseLoc	R@1,IoU=0.1	11.59	—	Unverified