SOTAVerified

Dense Video Captioning

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Papers

Showing 5176 of 76 papers

TitleStatusHype
Event-Equalized Dense Video Captioning0
Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching0
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning0
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning0
Technical Report for Soccernet 2023 -- Dense Video Captioning0
The 8th AI City Challenge0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
End-to-end Dense Video Captioning as Sequence Generation0
End-to-end Dense Video Captioning as Sequence Generation0
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning0
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges0
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement0
Dense Video Captioning using Graph-based Sentence Summarization0
Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols0
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding0
Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers0
A Review of Deep Learning for Video Captioning0
Video LLMs for Temporal Reasoning in Long Videos0
Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment0
Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos0
Jointly Localizing and Describing Events for Dense Video Captioning0
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering0
Exploiting Auxiliary Caption for Video Grounding0
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning0
Recipe Generation from Unsegmented Cooking Videos0
RUC+CMU: System Report for Dense Captioning Events in Videos0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTimeLLMCIDEr27.6Unverified
2Vid2SeqMETEOR17Unverified
3ADV-INF + GlobalMETEOR16.36Unverified
4Bi-directional+intra captioningMETEOR11.28Unverified
5GVLMETEOR10.03Unverified
6TSRM-CMG-HRNN+SCSTMETEOR9.71Unverified
7PDVC (TSP features, no SCST)METEOR9.03Unverified
8TSPMETEOR8.75Unverified
9CM²METEOR8.55Unverified
10BMTMETEOR8.44Unverified
#ModelMetricClaimedVerifiedStatus
1HiCM²CIDEr71.84Unverified
2Vid2Seq (HowTo100M+VidChapters-7M PT)CIDEr67.2Unverified
3Vid2SeqCIDEr47.1Unverified
4E2vidD6-MASSalign-BiDROUGE-L39.03Unverified
5CM²CIDEr31.66Unverified
6GVLCIDEr26.52Unverified
7PDVC (TSN features, no SCST)CIDEr22.71Unverified
#ModelMetricClaimedVerifiedStatus
1E2ESGCIDEr25Unverified
2Vid2Seq (VidChapters-7M PT)SODA0.15Unverified
3HiCM²SODA0.15Unverified
4Vid2SeqSODA0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr55.7Unverified