SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 151200 of 473 papers

TitleStatusHype
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction0
Event-Equalized Dense Video Captioning0
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval0
Hierarchical Banzhaf Interaction for General Video-Language Representation Learning0
PolySmart @ TRECVid 2024 Video Captioning (VTT)0
Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video CaptioningCode0
VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video PromptingCode0
Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning0
Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives0
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation0
Agent-based Video Trimming0
Video LLMs for Temporal Reasoning in Long Videos0
Progress-Aware Video Frame Captioning0
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation0
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding0
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity0
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning0
AdaCM^2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction0
Multi-Modal interpretable automatic video captioning0
Pseudo-labeling with Keyword Refining for Few-Supervised Video CaptioningCode0
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities0
Technical Report for Soccernet 2023 -- Dense Video Captioning0
EVC-MF: End-to-end Video Captioning Network with Multi-scale Features0
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning0
It's Just Another Day: Unique Video Captioning by Discriminative Prompting0
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models0
Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization0
SoccerNet 2024 Challenges ResultsCode0
Fine-grained length controllable video captioning with ordinal embeddings0
LongVILA: Scaling Long-Context Visual Language Models for Long Videos0
Dual-path Collaborative Generation Network for Emotional Video CaptioningCode0
Effectively Leveraging CLIP for Generating Situational Summaries of Images and VideosCode0
Wolf: Captioning Everything with a World Summarization Framework0
Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance0
EVLM: An Efficient Vision-Language Model for Visual Understanding0
https://arxiv.org/abs/2407.00634Code0
Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks0
Live Video CaptioningCode0
GUI Action Narrator: Where and When Did That Action Take Place?0
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset0
A Survey of Video Datasets for Grounded Event UnderstandingCode0
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative0
Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges0
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning0
A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)0
The 8th AI City Challenge0
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement0
Streaming Dense Video Captioning0
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding0
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation0
Show:102550
← PrevPage 4 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified