SOTAVerified

Dense Video Captioning

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. The task of dense video captioning involves both detecting and describing events in a video.

Papers

Showing 150 of 76 papers

TitleStatusHype
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language ModelsCode2
VidChapters-7M: Video Chapters at ScaleCode2
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long VideosCode2
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningCode2
VTimeLLM: Empower LLM to Grasp Video MomentsCode2
Do You Remember? Dense Video Captioning with Cross-Modal Memory RetrievalCode2
TrafficVLM: A Controllable Visual Language Model for Traffic Video CaptioningCode2
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts CommentariesCode2
OmniVid: A Generative Framework for Universal Video UnderstandingCode2
SODA: Story Oriented Dense Video Captioning Evaluation FrameworkCode1
Unifying Event Detection and Captioning as Sequence Generation via Pre-TrainingCode1
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction FormatCode1
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization TasksCode1
SoccerNet 2023 Challenges ResultsCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
Multimodal Pretraining for Dense Video CaptioningCode1
End-to-End Dense Video Captioning with Parallel DecodingCode1
Multi-modal Dense Video CaptioningCode1
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed VideosCode1
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerCode1
HiCM^2: Hierarchical Compact Memory Modeling for Dense Video CaptioningCode1
Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020Code1
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video CaptioningCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
Towards Automatic Learning of Procedures from Web Instructional VideosCode0
Bidirectional Attentive Fusion with Context Gating for Dense Video CaptioningCode0
Dense Video Captioning Using Unsupervised Semantic InformationCode0
End-to-End Dense Video Captioning with Masked TransformerCode0
Global Object Proposals for Improving Multi-Sentence Video DescriptionsCode0
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video CaptioningCode0
Joint Event Detection and Description in Continuous Video StreamsCode0
Live Video CaptioningCode0
Event and Entity Extraction from Generated Video CaptionsCode0
Sketch, Ground, and Refine: Top-Down Dense Video CaptioningCode0
SoccerNet 2024 Challenges ResultsCode0
Streaming Dense Video CaptioningCode0
Streamlined Dense Video CaptioningCode0
Visual Transformation TellingCode0
SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning0
SAVCHOI: Detecting Suspicious Activities using Dense Video Captioning with Human Object Interactions0
Semantic-Aware Pretraining for Dense Video Captioning0
A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos0
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding0
Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning0
Weakly Supervised Dense Video Captioning0
Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1VTimeLLMCIDEr27.6Unverified
2Vid2SeqMETEOR17Unverified
3ADV-INF + GlobalMETEOR16.36Unverified
4Bi-directional+intra captioningMETEOR11.28Unverified
5GVLMETEOR10.03Unverified
6TSRM-CMG-HRNN+SCSTMETEOR9.71Unverified
7PDVC (TSP features, no SCST)METEOR9.03Unverified
8TSPMETEOR8.75Unverified
9CM²METEOR8.55Unverified
10BMTMETEOR8.44Unverified
#ModelMetricClaimedVerifiedStatus
1HiCM²CIDEr71.84Unverified
2Vid2Seq (HowTo100M+VidChapters-7M PT)CIDEr67.2Unverified
3Vid2SeqCIDEr47.1Unverified
4E2vidD6-MASSalign-BiDROUGE-L39.03Unverified
5CM²CIDEr31.66Unverified
6GVLCIDEr26.52Unverified
7PDVC (TSN features, no SCST)CIDEr22.71Unverified
#ModelMetricClaimedVerifiedStatus
1E2ESGCIDEr25Unverified
2Vid2Seq (VidChapters-7M PT)SODA0.15Unverified
3HiCM²SODA0.15Unverified
4Vid2SeqSODA0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr55.7Unverified