SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 101150 of 473 papers

TitleStatusHype
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
A Comprehensive Review of the Video-to-Text ProblemCode1
The MSR-Video to Text Dataset with Clean AnnotationsCode1
Semantic Grouping Network for Video CaptioningCode1
A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading RulesCode1
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization TasksCode1
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and LanguageCode1
Multimodal Pretraining for Dense Video CaptioningCode1
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningCode1
Improved Actor Relation Graph based Group Activity RecognitionCode1
Poet: Product-oriented Video Captioner for E-commerceCode1
SODA: Story Oriented Dense Video Captioning Evaluation FrameworkCode1
Learning to Generate Grounded Visual Captions without Localization SupervisionCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Comprehensive Information Integration Modeling Framework for Video TitlingCode1
Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020Code1
Video Moment Localization using Object Evidence and Reverse CaptioningCode1
Syntax-Aware Action Targeting for Video CaptioningCode1
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerCode1
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph CaptioningCode1
A Benchmark for Structured Procedural Knowledge Extraction from Cooking VideosCode1
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingCode1
Multi-modal Dense Video CaptioningCode1
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video CaptioningCode1
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and GenerationCode1
Delving Deeper into the Decoder for Video CaptioningCode1
Learning to Generate Grounded Visual Captions without Localization SupervisionCode1
Large Scale Holistic Video UnderstandingCode1
What and How Well You Performed? A Multitask Learning Approach to Action Quality AssessmentCode1
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language ResearchCode1
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption GenerationCode1
Video captioning with recurrent networks based on frame- and video-level features and visual content classificationCode1
Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization0
Dense Video Captioning using Graph-based Sentence Summarization0
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks0
ARGUS: Hallucination and Omission Evaluation in Video-LLMs0
Temporal Object Captioning for Street Scene Videos from LiDAR Tracks0
FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal TasksCode0
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation0
Describe Anything: Detailed Localized Image and Video Captioning0
FocusedAD: Character-centric Movie Audio DescriptionCode0
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning0
Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding0
Get In Video: Add Anything You Want to the Video0
Fine-Grained Video Captioning through Scene Graph Consolidation0
LongCaptioning: Unlocking the Power of Long Caption Generation in Large Multimodal Models0
Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning0
Pretrained Image-Text Models are Secretly Video CaptionersCode0
MAMS: Model-Agnostic Module Selection Framework for Video Captioning0
Classifier-Guided Captioning Across Modalities0
Show:102550
← PrevPage 3 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified