SOTAVerified

Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. For this task, different levels of supervision are used. 1) Weak supervision: video-level action category set; 2) Semi-weak supervision: video-level action category set, and action annotations at several timestamps; 3) Full supervision: Action category and action interval annotations of all actions in untrimmed videos.

Papers

Showing 143 of 43 papers

TitleStatusHype
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language ModelsCode2
Span-based Localizing Network for Natural Language Video LocalizationCode1
Negative Sample Matters: A Renaissance of Metric Learning for Temporal GroundingCode1
BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in VideosCode1
Uncovering Hidden Challenges in Query-Based Video Moment RetrievalCode1
Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video GroundingCode1
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance AnnotationCode1
Weakly Supervised Temporal Sentence Grounding With Gaussian-Based Contrastive Proposal LearningCode1
DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long VideosCode1
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos0
Hierarchical Local-Global Transformer for Temporal Sentence Grounding0
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos0
Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition0
A Survey on Temporal Sentence Grounding in Videos0
Constraint and Union for Partially-Supervised Temporal Sentence Grounding0
Context-aware Biaffine Localizing Network for Temporal Sentence Grounding0
Contrast-Unity for Partially-Supervised Temporal Sentence Grounding0
Diversified Augmentation with Domain Adaptation for Debiased Video Temporal Grounding0
Exploring Motion and Appearance Information for Temporal Sentence Grounding0
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding0
Towards Debiasing Temporal Sentence Grounding in Video0
Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video0
Tracking Objects and Activities with Attention for Temporal Sentence Grounding0
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding0
Video sentence grounding with temporally global textual knowledge0
Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training0
Weakly Supervised Temporal Sentence Grounding via Positive Sample Mining0
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach0
Learning to Focus on the Foreground for Temporal Sentence Grounding0
Memory-Guided Semantic Learning Network for Temporal Sentence Grounding0
Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network0
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding0
Reducing the Vision and Language Bias for Temporal Sentence Grounding0
Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding0
Temporal Sentence Grounding in Videos: A Survey and Future Directions0
Learning Temporal Sentence Grounding From Narrated EgoVideosCode0
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in VideosCode0
Diversifying Query: Region-Guided Transformer for Temporal Sentence GroundingCode0
Temporal Sentence Grounding in Streaming VideosCode0
Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge DistillationCode0
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in VideoCode0
A Closer Look at Temporal Sentence Grounding in Videos: Dataset and MetricCode0
Transformer with Controlled Attention for Synchronous Motion CaptioningCode0
Show:102550

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1DeCafNetR1@0.747.55Unverified
2AdaFocus (Full, MViT-Charades-Pretrain-feature, MMN model)R1@0.738.6Unverified
3AdaFocus (Full, I3D-Charades-Pretrain-feature, MMN model)R1@0.735.6Unverified
4MMN (Full, MViT-K400-Pretrain-feature, evaluated by AdaFocus)R1@0.732.2Unverified
5MMN (Full, I3D-K400-Pretrain-feature, evaluated by AdaFocus)R1@0.729.8Unverified
6AdaFocus (Weak, MViT-Charades-Pretrain-feature, CPL model)R1@0.723.2Unverified
7AdaFocus (Weak, I3D-Charades-Pretrain-feature, CPL model)R1@0.722.4Unverified
8CPL (Weak, MViT-K400-Pretrain-feature, evaluated by AdaFocus)R1@0.721.8Unverified
9AdaFocus (Semi-weak, MViT-Charades-Pretrain-feature, D3G model)R1@0.721.8Unverified
10AdaFocus (Semi-weak, I3D-Charades-Pretrain-feature, D3G model)R1@0.721.1Unverified
#ModelMetricClaimedVerifiedStatus
1DeCafNet-100%R@1,IoU=0.323.2Unverified
2DeCafNet-50%R@1,IoU=0.321.29Unverified
3VSLNetR@1,IoU=0.311.7Unverified