SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 51100 of 473 papers

TitleStatusHype
Language Models with Image Descriptors are Strong Few-Shot Video-Language LearnersCode1
Learning Multi-modal Representations by Watching Hundreds of Surgical Video LecturesCode1
Delving Deeper into the Decoder for Video CaptioningCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
The MSR-Video to Text Dataset with Clean AnnotationsCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading RulesCode1
HiCM^2: Hierarchical Compact Memory Modeling for Dense Video CaptioningCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary GenerationCode1
A Benchmark for Structured Procedural Knowledge Extraction from Cooking VideosCode1
Semantic Grouping Network for Video CaptioningCode1
SoccerNet 2023 Challenges ResultsCode1
Prompt Switch: Efficient CLIP Adaptation for Text-Video RetrievalCode1
Positive-Augmented Contrastive Learning for Image and Video Captioning EvaluationCode1
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using PatchesCode1
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
Action knowledge for video captioning with graph neural networksCode1
Fine-grained Audible Video DescriptionCode1
Hierarchical Modular Network for Video CaptioningCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
Controllable Video Captioning with an Exemplar SentenceCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and LanguageCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data GenerationCode1
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningCode1
PaLI-X: On Scaling up a Multilingual Vision and Language ModelCode1
Co-segmentation Inspired Attention Module for Video-based Computer Vision TasksCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
Poet: Product-oriented Video Captioner for E-commerceCode1
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption GenerationCode1
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal MappingCode1
Multi-modal Dense Video CaptioningCode1
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsCode1
GL-RG: Global-Local Representation Granularity for Video CaptioningCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Multimodal Pretraining for Dense Video CaptioningCode1
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal DataCode1
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4oCode1
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingCode1
Comprehensive Information Integration Modeling Framework for Video TitlingCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleCode1
Large Scale Holistic Video UnderstandingCode1
A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal TransformerCode1
Narrative Action Evaluation with Prompt-Guided Multimodal InteractionCode1
Show:102550
← PrevPage 2 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified