SOTAVerified

Dense Video Captioning

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Papers

Showing 5176 of 76 papers

TitleStatusHype
Event and Entity Extraction from Generated Video CaptionsCode0
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos0
Recipe Generation from Unsegmented Cooking Videos0
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions0
PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning0
End-to-end Dense Video Captioning as Sequence Generation0
Semantic-Aware Pretraining for Dense Video Captioning0
End-to-end Dense Video Captioning as Sequence Generation0
Dense Video Captioning Using Unsupervised Semantic InformationCode0
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning0
Global Object Proposals for Improving Multi-Sentence Video DescriptionsCode0
Sketch, Ground, and Refine: Top-Down Dense Video CaptioningCode0
Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching0
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering0
SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning0
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning0
Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos0
Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers0
Streamlined Dense Video CaptioningCode0
RUC+CMU: System Report for Dense Captioning Events in Videos0
Jointly Localizing and Describing Events for Dense Video Captioning0
End-to-End Dense Video Captioning with Masked TransformerCode0
Bidirectional Attentive Fusion with Context Gating for Dense Video CaptioningCode0
Joint Event Detection and Description in Continuous Video StreamsCode0
Weakly Supervised Dense Video Captioning0
Towards Automatic Learning of Procedures from Web Instructional VideosCode0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTimeLLMCIDEr27.6Unverified
2Vid2SeqMETEOR17Unverified
3ADV-INF + GlobalMETEOR16.36Unverified
4Bi-directional+intra captioningMETEOR11.28Unverified
5GVLMETEOR10.03Unverified
6TSRM-CMG-HRNN+SCSTMETEOR9.71Unverified
7PDVC (TSP features, no SCST)METEOR9.03Unverified
8TSPMETEOR8.75Unverified
9CM²METEOR8.55Unverified
10BMTMETEOR8.44Unverified
#ModelMetricClaimedVerifiedStatus
1HiCM²CIDEr71.84Unverified
2Vid2Seq (HowTo100M+VidChapters-7M PT)CIDEr67.2Unverified
3Vid2SeqCIDEr47.1Unverified
4E2vidD6-MASSalign-BiDROUGE-L39.03Unverified
5CM²CIDEr31.66Unverified
6GVLCIDEr26.52Unverified
7PDVC (TSN features, no SCST)CIDEr22.71Unverified
#ModelMetricClaimedVerifiedStatus
1E2ESGCIDEr25Unverified
2Vid2Seq (VidChapters-7M PT)SODA0.15Unverified
3HiCM²SODA0.15Unverified
4Vid2SeqSODA0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr55.7Unverified