SOTAVerified

Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Showing 51100 of 473 papers

TitleStatusHype
Delving Deeper into the Decoder for Video CaptioningCode1
LAVENDER: Unifying Video-Language Understanding as Masked Language ModelingCode1
Learning to Discretely Compose Reasoning Module Networks for Video CaptioningCode1
Learning Multi-modal Representations by Watching Hundreds of Surgical Video LecturesCode1
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual ModelingCode1
The MSR-Video to Text Dataset with Clean AnnotationsCode1
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language BenchmarkCode1
Syntax-Aware Action Targeting for Video CaptioningCode1
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization TasksCode1
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and SummarizationCode1
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph CaptioningCode1
Hierarchical Modular Network for Video CaptioningCode1
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot VideosCode1
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary GenerationCode1
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4oCode1
Hierarchical Video-Moment Retrieval and Step-CaptioningCode1
SoccerNet 2023 Challenges ResultsCode1
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using PatchesCode1
Prompt Switch: Efficient CLIP Adaptation for Text-Video RetrievalCode1
RTQ: Rethinking Video-language Understanding Based on Image-text ModelCode1
Fine-grained Audible Video DescriptionCode1
COOT: Cooperative Hierarchical Transformer for Video-Text Representation LearningCode1
Action knowledge for video captioning with graph neural networksCode1
Large Scale Holistic Video UnderstandingCode1
Semantic Grouping Network for Video CaptioningCode1
Controllable Video Captioning with an Exemplar SentenceCode1
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event AnalysisCode1
Learning Video Context as Interleaved Multimodal SequencesCode1
PaLI-X: On Scaling up a Multilingual Vision and Language ModelCode1
Partially Relevant Video RetrievalCode1
COSA: Concatenated Sample Pretrained Vision-Language Foundation ModelCode1
Co-segmentation Inspired Attention Module for Video-based Computer Vision TasksCode1
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption GenerationCode1
Narrative Action Evaluation with Prompt-Guided Multimodal InteractionCode1
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal MappingCode1
Frozen in Time: A Joint Video and Image Encoder for End-to-End RetrievalCode1
A Benchmark for Structured Procedural Knowledge Extraction from Cooking VideosCode1
GL-RG: Global-Local Representation Granularity for Video CaptioningCode1
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsCode1
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationCode1
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and LanguageCode1
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal DataCode1
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-trainingCode1
HiCM^2: Hierarchical Compact Memory Modeling for Dense Video CaptioningCode1
Comprehensive Information Integration Modeling Framework for Video TitlingCode1
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video UnderstandingCode1
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot CaptioningCode1
Language Models with Image Descriptors are Strong Few-Shot Video-Language LearnersCode1
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data GenerationCode1
Poet: Product-oriented Video Captioner for E-commerceCode1
Show:102550
← PrevPage 2 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1mPLUG-2CIDEr80Unverified
2VASTCIDEr78Unverified
3GIT2CIDEr75.9Unverified
4VLABCIDEr74.9Unverified
5COSACIDEr74.7Unverified
6VALORCIDEr74Unverified
7MaMMUT (ours)CIDEr73.6Unverified
8VideoCoCaCIDEr73.2Unverified
9RTQCIDEr69.3Unverified
10HowToCaptionCIDEr65.3Unverified
#ModelMetricClaimedVerifiedStatus
1MaMMUTCIDEr195.6Unverified
2VLABCIDEr179.8Unverified
3COSACIDEr178.5Unverified
4VALORCIDEr178.5Unverified
5mPLUG-2CIDEr165.8Unverified
6HowToCaptionCIDEr154.2Unverified
7HiTeACIDEr146.9Unverified
8Vid2SeqCIDEr146.2Unverified
9VIOLETv2CIDEr139.2Unverified
10RTQCIDEr123.4Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-418.2Unverified
2UniVL + MELTRBLEU-417.92Unverified
3UniVLBLEU-417.35Unverified
4VideoCoCaBLEU-414.2Unverified
5VLMBLEU-412.27Unverified
6E2vidD6-MASSvid-BiDBLEU-412.04Unverified
7TextKGBLEU-411.7Unverified
8COOTBLEU-411.3Unverified
9COSABLEU-410.1Unverified
10HowToCaptionBLEU-48.8Unverified
#ModelMetricClaimedVerifiedStatus
1VALORBLEU-445.6Unverified
2VASTBLEU-445Unverified
3COSABLEU-443.7Unverified
4VideoCoCaBLEU-439.7Unverified
5IcoCap (ViT-B/16)BLEU-437.4Unverified
6IcoCap (ViT-B/32)BLEU-436.9Unverified
7VASTA (Kinetics-backbone)BLEU-436.25Unverified
8CoCap (ViT/L14)BLEU-435.8Unverified
9ORG-TRLBLEU-432.1Unverified
10NITS-VCBLEU-420Unverified
#ModelMetricClaimedVerifiedStatus
1VideoCoCaBLEU414.7Unverified
2VLTinT (ae-test split) C3D/LingBLEU414.5Unverified
3VLCap (ae-test split) - Appearance + LanguageBLEU413.38Unverified
4COOT (ae-test split) - Only Appearance featuresBLEU410.85Unverified
5MART (ae-test split) - Appearance + FlowBLEU410.33Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr49.87Unverified
2GITCIDEr32.43Unverified
3SEM-POSCIDEr26.01Unverified
4AKGNNCIDEr25.9Unverified
#ModelMetricClaimedVerifiedStatus
1CENCIDEr63.51Unverified
2GITCIDEr45.63Unverified
3SEM-POSCIDEr37.16Unverified
4AKGNNCIDEr35.08Unverified
#ModelMetricClaimedVerifiedStatus
1SBD_KeyframeBLEU441.01Unverified
2V+S-Att-basedBLEU436.2Unverified
#ModelMetricClaimedVerifiedStatus
1VASTBLEU-419.9Unverified
2COSABLEU-418.8Unverified
#ModelMetricClaimedVerifiedStatus
1GVTBLEU417.7Unverified
#ModelMetricClaimedVerifiedStatus
1VNS-GRU (Cross-Lingual)BLEU-458.68Unverified
#ModelMetricClaimedVerifiedStatus
1Shot2StoryCIDEr37.4Unverified
#ModelMetricClaimedVerifiedStatus
1Vid2SeqCIDEr120.5Unverified