SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 150 of 473 papers

TitleStatusHype
CogVideoX: Text-to-Video Diffusion Models with An Expert TransformerCode11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsCode5
ShareGPT4Video: Improving Video Understanding and Generation with Better CaptionsCode5
Tarsier: Recipes for Training and Evaluating Large Video Description ModelsCode4
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality TeachersCode4
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and VideoCode4
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal UnderstandingCode3
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video UnderstandingCode3
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video UnderstandingCode3
GiT: Towards Generalist Vision Transformer through Universal Language InterfaceCode3
Video ReCap: Recursive Captioning of Hour-Long VideosCode3
CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal ReasoningCode3
Vision-Language Pre-training: Basics, Recent Advances, and Future TrendsCode3
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language ModelsCode2
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal PromptingCode2
LVD-2M: A Long-take Video Dataset with Temporally Dense CaptionsCode2
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language ModelsCode2
SkyScript-100M: 1,000,000,000 Pairs of Scripts and Shooting Scripts for Short DramaCode2
Vript: A Video Is Worth Thousands of WordsCode2
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal GroundingCode2
Movie101v2: Improved Movie Narration BenchmarkCode2
TrafficVLM: A Controllable Visual Language Model for Traffic Video CaptioningCode2
Do You Remember? Dense Video Captioning with Cross-Modal Memory RetrievalCode2
OmniVid: A Generative Framework for Universal Video UnderstandingCode2
VTimeLLM: Empower LLM to Grasp Video MomentsCode2
VidChapters-7M: Video Chapters at ScaleCode2
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and BenchmarksCode2
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetCode2
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and DatasetCode2
SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts CommentariesCode2
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsCode2
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video CaptioningCode2
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?Code2
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEsCode2
GIT: A Generative Image-to-text Transformer for Vision and LanguageCode2
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New BenchmarksCode1
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video GenerationCode1
VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video CaptioningCode1
HiCM^2: Hierarchical Compact Memory Modeling for Dense Video CaptioningCode1
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4oCode1
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction FormatCode1
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot CaptioningCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
Narrative Action Evaluation with Prompt-Guided Multimodal InteractionCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
LVCHAT: Facilitating Long Video ComprehensionCode1
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal DataCode1
Show:102550
← PrevPage 1 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified