SOTAVerified

Dense Video Captioning

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Papers

Showing 5176 of 76 papers

TitleStatusHype
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding0
Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization0
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning0
Technical Report for Soccernet 2023 -- Dense Video Captioning0
The 8th AI City Challenge0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges0
Video LLMs for Temporal Reasoning in Long Videos0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Weakly Supervised Dense Video Captioning0
Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching0
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning0
Event and Entity Extraction from Generated Video CaptionsCode0
Live Video CaptioningCode0
Joint Event Detection and Description in Continuous Video StreamsCode0
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video CaptioningCode0
Streamlined Dense Video CaptioningCode0
Global Object Proposals for Improving Multi-Sentence Video DescriptionsCode0
Bidirectional Attentive Fusion with Context Gating for Dense Video CaptioningCode0
Visual Transformation TellingCode0
End-to-End Dense Video Captioning with Masked TransformerCode0
Towards Automatic Learning of Procedures from Web Instructional VideosCode0
Streaming Dense Video CaptioningCode0
SoccerNet 2024 Challenges ResultsCode0
Dense Video Captioning Using Unsupervised Semantic InformationCode0
Sketch, Ground, and Refine: Top-Down Dense Video CaptioningCode0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTimeLLMCIDEr27.6Unverified
2Vid2SeqMETEOR17Unverified
3ADV-INF + GlobalMETEOR16.36Unverified
4Bi-directional+intra captioningMETEOR11.28Unverified
5GVLMETEOR10.03Unverified
6TSRM-CMG-HRNN+SCSTMETEOR9.71Unverified
7PDVC (TSP features, no SCST)METEOR9.03Unverified
8TSPMETEOR8.75Unverified
9CM²METEOR8.55Unverified
10BMTMETEOR8.44Unverified
#ModelMetricClaimedVerifiedStatus
1HiCM²CIDEr71.84Unverified
2Vid2Seq (HowTo100M+VidChapters-7M PT)CIDEr67.2Unverified
3Vid2SeqCIDEr47.1Unverified
4E2vidD6-MASSalign-BiDROUGE-L39.03Unverified
5CM²CIDEr31.66Unverified
6GVLCIDEr26.52Unverified
7PDVC (TSN features, no SCST)CIDEr22.71Unverified
#ModelMetricClaimedVerifiedStatus
1E2ESGCIDEr25Unverified
2Vid2Seq (VidChapters-7M PT)SODA0.15Unverified
3HiCM²SODA0.15Unverified
4Vid2SeqSODA0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr55.7Unverified