SOTAVerified

Dense Video Captioning

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Papers

Showing 5175 of 76 papers

TitleStatusHype
Recipe Generation from Unsegmented Cooking Videos0
RUC+CMU: System Report for Dense Captioning Events in Videos0
SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning0
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions0
Semantic-Aware Pretraining for Dense Video Captioning0
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding0
Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization0
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges0
Video LLMs for Temporal Reasoning in Long Videos0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Weakly Supervised Dense Video Captioning0
Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching0
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning0
Live Video CaptioningCode0
Joint Event Detection and Description in Continuous Video StreamsCode0
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video CaptioningCode0
Global Object Proposals for Improving Multi-Sentence Video DescriptionsCode0
Bidirectional Attentive Fusion with Context Gating for Dense Video CaptioningCode0
Visual Transformation TellingCode0
End-to-End Dense Video Captioning with Masked TransformerCode0
Towards Automatic Learning of Procedures from Web Instructional VideosCode0
Streamlined Dense Video CaptioningCode0
SoccerNet 2024 Challenges ResultsCode0
Dense Video Captioning Using Unsupervised Semantic InformationCode0
Sketch, Ground, and Refine: Top-Down Dense Video CaptioningCode0
Show:102550
← PrevPage 3 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTimeLLMCIDEr27.6Unverified
2Vid2SeqMETEOR17Unverified
3ADV-INF + GlobalMETEOR16.36Unverified
4Bi-directional+intra captioningMETEOR11.28Unverified
5GVLMETEOR10.03Unverified
6TSRM-CMG-HRNN+SCSTMETEOR9.71Unverified
7PDVC (TSP features, no SCST)METEOR9.03Unverified
8TSPMETEOR8.75Unverified
9CM²METEOR8.55Unverified
10BMTMETEOR8.44Unverified
#ModelMetricClaimedVerifiedStatus
1HiCM²CIDEr71.84Unverified
2Vid2Seq (HowTo100M+VidChapters-7M PT)CIDEr67.2Unverified
3Vid2SeqCIDEr47.1Unverified
4E2vidD6-MASSalign-BiDROUGE-L39.03Unverified
5CM²CIDEr31.66Unverified
6GVLCIDEr26.52Unverified
7PDVC (TSN features, no SCST)CIDEr22.71Unverified
#ModelMetricClaimedVerifiedStatus
1E2ESGCIDEr25Unverified
2Vid2Seq (VidChapters-7M PT)SODA0.15Unverified
3HiCM²SODA0.15Unverified
4Vid2SeqSODA0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr55.7Unverified