SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 351400 of 473 papers

TitleStatusHype
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation0
Sparse Graph to Sequence Learning for Vision Conditioned Long Textual Sequence Generation0
Spatio-Temporal Attention Models for Grounded Video Captioning0
Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning0
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation0
Spatio-Temporal Ranked-Attention Networks for Video Captioning0
SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities0
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Story Generation from Visual Inputs: Techniques, Related Tasks, and Challenges0
Storytelling of Photo Stream with Bidirectional Multi-thread Recurrent Neural Network0
Streaming Dense Video Captioning0
Watch It Twice: Video Captioning with a Refocused Video Encoder0
Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos0
SOVC: Subject-Oriented Video Captioning0
Supervising Neural Attention Models for Video Captioning by Human Gaze Data0
Active Learning for Video Description With Cluster-Regularized Ensemble Ranking0
CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising0
CLIP4Caption: CLIP for Video Caption0
Weakly Supervised Dense Video Captioning0
Classifier-Guided Captioning Across Modalities0
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description0
TCR: Short Video Title Generation and Cover Selection with Attention Refinement0
Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning0
Technical Report for Soccernet 2023 -- Dense Video Captioning0
Chinese Whispers: Cooperative Paraphrase Acquisition0
Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching0
Temporally Grounding Natural Sentence in Video0
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks0
Temporal Perceiving Video-Language Pre-training0
A Dataset for Telling the Stories of Social Media Videos0
Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models0
Text with Knowledge Graph Augmented Transformer for Video Captioning0
The 8th AI City Challenge0
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning0
The Use of Video Captioning for Fostering Physical Activity0
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
Title Generation for User Generated Videos0
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning0
Adaptive Feature Abstraction for Translating Video to Text0
Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning0
Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset0
Prediction and Description of Near-Future Activities in Video0
Wolf: Captioning Everything with a World Summarization Framework0
Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications0
Translating Videos to Natural Language Using Deep Recurrent Neural Networks0
TRECVID 2019: An Evaluation Campaign to Benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & Retrieval0
FIOVA: A Multi-Annotator Benchmark for Human-Aligned Video Captioning0
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges0
Show:102550
← PrevPage 8 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified