Video Captioning

Video Captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text.

Source: NITS-VC System for VATEX Video Captioning Challenge 2020

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 473 papers

Title	Date	Tasks	Status	Hype	Score
Delving Deeper into the Decoder for Video Captioning	Jan 16, 2020	DecoderSentence	CodeCode Available	1	5
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling	Jun 14, 2022	DecoderLanguage Modeling	CodeCode Available	1	5
Learning to Discretely Compose Reasoning Module Networks for Video Captioning	Jul 17, 2020	DecoderQuestion Answering	CodeCode Available	1	5
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures	Jul 27, 2023	Automatic Speech RecognitionContrastive Learning	CodeCode Available	1	5
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling	Sep 4, 2022	Fill MaskOptical Flow Estimation	CodeCode Available	1	5
The MSR-Video to Text Dataset with Clean Annotations	Feb 12, 2021	SentenceVideo Captioning	CodeCode Available	1	5
COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark	Aug 5, 2024	Dense Video CaptioningDiversity	CodeCode Available	1	5
Syntax-Aware Action Targeting for Video Captioning	Jun 1, 2020	Video Captioning	CodeCode Available	1	5
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks	Nov 23, 2020	Action ClassificationAction Localization	CodeCode Available	1	5
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization	May 31, 2024	SentenceVideo Captioning	CodeCode Available	1	5
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning	May 11, 2020	SentenceVideo Captioning	CodeCode Available	1	5
Hierarchical Modular Network for Video Captioning	Nov 24, 2021	Representation LearningSentence	CodeCode Available	1	5
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos	Dec 16, 2023	Video Captioningvideo narration captioning	CodeCode Available	1	5
GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation	Mar 26, 2023	Video Captioning	CodeCode Available	1	5
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o	Dec 18, 2024	Image CaptioningVideo Captioning	CodeCode Available	1	5
Hierarchical Video-Moment Retrieval and Step-Captioning	Mar 29, 2023	Information RetrievalMoment Retrieval	CodeCode Available	1	5
SoccerNet 2023 Challenges Results	Sep 12, 2023	Action SpottingCamera Calibration	CodeCode Available	1	5
Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches	Jun 30, 2022	Caption GenerationVideo Captioning	CodeCode Available	1	5
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval	Aug 15, 2023	RetrievalVideo Captioning	CodeCode Available	1	5
RTQ: Rethinking Video-language Understanding Based on Image-text Model	Dec 1, 2023	Video CaptioningVideo Question Answering	CodeCode Available	1	5
Fine-grained Audible Video Description	Mar 27, 2023	Language ModelingLanguage Modelling	CodeCode Available	1	5
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning	Nov 1, 2020	Cross-Modal RetrievalRepresentation Learning	CodeCode Available	1	5
Action knowledge for video captioning with graph neural networks	Mar 16, 2023	Action RecognitionGraph Neural Network	CodeCode Available	1	5
Large Scale Holistic Video Understanding	Apr 25, 2019	Action ClassificationAction Recognition	CodeCode Available	1	5
Semantic Grouping Network for Video Captioning	Feb 1, 2021	Video Captioning	CodeCode Available	1	5
Controllable Video Captioning with an Exemplar Sentence	Dec 2, 2021	Caption GenerationDecoder	CodeCode Available	1	5
Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis	Apr 12, 2024	Dense Video CaptioningTransfer Learning	CodeCode Available	1	5
Learning Video Context as Interleaved Multimodal Sequences	Jul 31, 2024	Language ModelingLanguage Modelling	CodeCode Available	1	5
PaLI-X: On Scaling up a Multilingual Vision and Language Model	May 29, 2023	Chart Question Answeringdocument understanding	CodeCode Available	1	5
Partially Relevant Video Retrieval	Aug 26, 2022	Moment RetrievalMultiple Instance Learning	CodeCode Available	1	5
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model	Jun 15, 2023	Formmodel	CodeCode Available	1	5
Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks	Nov 14, 2021	Action ClassificationObject	CodeCode Available	1	5
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation	Aug 17, 2016	Caption GenerationDecoder	CodeCode Available	1	5
Narrative Action Evaluation with Prompt-Guided Multimodal Interaction	Apr 22, 2024	Action Quality Assessmentmultimodal interaction	CodeCode Available	1	5
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping	Apr 26, 2023	DecoderImage Captioning	CodeCode Available	1	5
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	Apr 1, 2021	RetrievalText Retrieval	CodeCode Available	1	5
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos	May 2, 2020	Action DetectionForm	CodeCode Available	1	5
GL-RG: Global-Local Representation Granularity for Video Captioning	May 22, 2022	Caption GenerationDescriptive	CodeCode Available	1	5
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations	Nov 21, 2022	Contrastive LearningRepresentation Learning	CodeCode Available	1	5
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization	Jun 1, 2021	Question AnsweringRetrieval	CodeCode Available	1	5
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language	Nov 18, 2020	Dictionary LearningDisentanglement	CodeCode Available	1	5
Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data	Jan 16, 2024	Image GenerationText to Image Generation	CodeCode Available	1	5
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training	May 1, 2020	Language ModelingLanguage Modelling	CodeCode Available	1	5
HiCM^2: Hierarchical Compact Memory Modeling for Dense Video Captioning	Dec 19, 2024	Dense Video CaptioningVideo Captioning	CodeCode Available	1	5
Comprehensive Information Integration Modeling Framework for Video Titling	Jun 24, 2020	DescriptiveVideo Captioning	CodeCode Available	1	5
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding	Jun 19, 2024	Question AnsweringSpatial Reasoning	CodeCode Available	1	5
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning	Sep 26, 2024	Image CaptioningRetrieval	CodeCode Available	1	5
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners	May 22, 2022	AttributeAutomatic Speech Recognition	CodeCode Available	1	5
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation	Aug 8, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	CodeCode Available	1	5
Poet: Product-oriented Video Captioner for E-commerce	Aug 16, 2020	Video Captioning	CodeCode Available	1	5

Show:10 25 50

← PrevPage 2 of 10Next →

All datasets MSR-VTT MSVD YouCook2 VATEX ActivityNet Captions MSRVTT-CTN MSVD-CTN Hindi MSR-VTT TVC ChinaOpen-1k MSVD-Indonesian Shot2Story20K

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	mPLUG-2	CIDEr	80	—	Unverified
2	VAST	CIDEr	78	—	Unverified
3	GIT2	CIDEr	75.9	—	Unverified
4	VLAB	CIDEr	74.9	—	Unverified
5	COSA	CIDEr	74.7	—	Unverified
6	VALOR	CIDEr	74	—	Unverified
7	MaMMUT (ours)	CIDEr	73.6	—	Unverified
8	VideoCoCa	CIDEr	73.2	—	Unverified
9	RTQ	CIDEr	69.3	—	Unverified
10	HowToCaption	CIDEr	65.3	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	MaMMUT	CIDEr	195.6	—	Unverified
2	VLAB	CIDEr	179.8	—	Unverified
3	COSA	CIDEr	178.5	—	Unverified
4	VALOR	CIDEr	178.5	—	Unverified
5	mPLUG-2	CIDEr	165.8	—	Unverified
6	HowToCaption	CIDEr	154.2	—	Unverified
7	HiTeA	CIDEr	146.9	—	Unverified
8	Vid2Seq	CIDEr	146.2	—	Unverified
9	VIOLETv2	CIDEr	139.2	—	Unverified
10	RTQ	CIDEr	123.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VAST	BLEU-4	18.2	—	Unverified
2	UniVL + MELTR	BLEU-4	17.92	—	Unverified
3	UniVL	BLEU-4	17.35	—	Unverified
4	VideoCoCa	BLEU-4	14.2	—	Unverified
5	VLM	BLEU-4	12.27	—	Unverified
6	E2vidD6-MASSvid-BiD	BLEU-4	12.04	—	Unverified
7	TextKG	BLEU-4	11.7	—	Unverified
8	COOT	BLEU-4	11.3	—	Unverified
9	COSA	BLEU-4	10.1	—	Unverified
10	HowToCaption	BLEU-4	8.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	BLEU-4	45.6	—	Unverified
2	VAST	BLEU-4	45	—	Unverified
3	COSA	BLEU-4	43.7	—	Unverified
4	VideoCoCa	BLEU-4	39.7	—	Unverified
5	IcoCap (ViT-B/16)	BLEU-4	37.4	—	Unverified
6	IcoCap (ViT-B/32)	BLEU-4	36.9	—	Unverified
7	VASTA (Kinetics-backbone)	BLEU-4	36.25	—	Unverified
8	CoCap (ViT/L14)	BLEU-4	35.8	—	Unverified
9	ORG-TRL	BLEU-4	32.1	—	Unverified
10	NITS-VC	BLEU-4	20	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VideoCoCa	BLEU4	14.7	—	Unverified
2	VLTinT (ae-test split) C3D/Ling	BLEU4	14.5	—	Unverified
3	VLCap (ae-test split) - Appearance + Language	BLEU4	13.38	—	Unverified
4	COOT (ae-test split) - Only Appearance features	BLEU4	10.85	—	Unverified
5	MART (ae-test split) - Appearance + Flow	BLEU4	10.33	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CEN	CIDEr	49.87	—	Unverified
2	GIT	CIDEr	32.43	—	Unverified
3	SEM-POS	CIDEr	26.01	—	Unverified
4	AKGNN	CIDEr	25.9	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	CEN	CIDEr	63.51	—	Unverified
2	GIT	CIDEr	45.63	—	Unverified
3	SEM-POS	CIDEr	37.16	—	Unverified
4	AKGNN	CIDEr	35.08	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	SBD_Keyframe	BLEU4	41.01	—	Unverified
2	V+S-Att-based	BLEU4	36.2	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VAST	BLEU-4	19.9	—	Unverified
2	COSA	BLEU-4	18.8	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GVT	BLEU4	17.7	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VNS-GRU (Cross-Lingual)	BLEU-4	58.68	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Shot2Story	CIDEr	37.4	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	Vid2Seq	CIDEr	120.5	—	Unverified