Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1351–1400 of 1878 papers

Title	Date	Tasks	Status
Variational Distribution Learning for Unsupervised Text-to-Image Generation	Mar 28, 2023	Image CaptioningImage Generation	—Unverified
Variational Structured Semantic Inference for Diverse Image Captioning	Dec 1, 2019	DecoderDiversity	—Unverified
A Frustratingly Simple Approach for End-to-End Image Captioning	Jan 30, 2022	DecoderImage Captioning	—Unverified
VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall	Jan 15, 2025	Image Captioning	—Unverified
Vector Learning for Cross Domain Representations	Sep 27, 2018	DecoderImage Captioning	—Unverified
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models	Oct 9, 2023	Image CaptioningVisual Commonsense Reasoning	—Unverified
Video Event Detection by Exploiting Word Dependencies from Image Captions	Dec 1, 2016	Action DetectionEvent Detection	—Unverified
VideoGameBunny: Towards vision assistants for video games	Jul 21, 2024	Image CaptioningScene Understanding	—Unverified
VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning	Nov 10, 2022	Image CaptioningVietnamese Image Captioning	—Unverified
vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM	Sep 3, 2022	DecoderImage Captioning	—Unverified
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder	Nov 15, 2023	DecoderImage Captioning	—Unverified
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use	Oct 21, 2024	Image CaptioningTask Planning	—Unverified
ViP-CNN: Visual Phrase Guided Convolutional Neural Network	Feb 23, 2017	DescriptiveImage Captioning	—Unverified
VisBuddy -- A Smart Wearable Assistant for the Visually Challenged	Aug 17, 2021	Image Captioningobject-detection	—Unverified
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified
Vision and Language Integration: Moving beyond Objects	Jan 1, 2017	Action ClassificationImage Captioning	—Unverified
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction	Feb 28, 2024	Image CaptioningLanguage Modeling	—Unverified
Vision Language Models Can Parse Floor Plan Maps	Sep 19, 2024	Image CaptioningQuestion Answering	—Unverified
Vision-Language Models for Edge Networks: A Comprehensive Survey	Feb 11, 2025	Autonomous VehiclesImage Captioning	—Unverified
Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals	Dec 12, 2024	Image CaptioningImage Generation	—Unverified
Vision-to-Language Tasks Based on Attributes and Attention Mechanism	May 29, 2019	Image CaptioningQuestion Answering	—Unverified
Vispi: Automatic Visual Perception and Interpretation of Chest X-rays	Jun 12, 2019	DiagnosticImage Captioning	—Unverified
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning	Nov 2, 2023	Caption GenerationEfficient Exploration	—Unverified
Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences	Jul 31, 2023	DecoderImage Captioning	—Unverified
Visual Classifier Prediction by Distributional Semantic Embedding of Text Descriptions	Sep 1, 2015	Domain AdaptationImage Captioning	—Unverified
Visual Hallucination: Definition, Quantification, and Prescriptive Remediations	Mar 26, 2024	HallucinationImage Captioning	—Unverified
Visual Information Matters for ASR Error Correction	Mar 16, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Visually Guided Spatial Relation Extraction from Text	Jun 1, 2018	Activity RecognitionImage Captioning	—Unverified
Visual Question Answering Dataset for Bilingual Image Understanding: A Study of Cross-Lingual Transfer Using Attention Maps	Aug 1, 2018	Cross-Lingual TransferImage Captioning	—Unverified
Visual representation of negation: Real world data analysis on comic image design	May 21, 2021	Image Captioningimage-classification	—Unverified
Visual Transformer for Object Detection	Jun 1, 2022	Image CaptioningMachine Translation	—Unverified
ViTOC: Vision Transformer and Object-aware Captioner	Nov 9, 2024	DiversityImage Captioning	—Unverified
VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning	Sep 28, 2020	Image CaptioningObject	—Unverified
VLRM: Vision-Language Models act as Reward Models for Image Captioning	Apr 2, 2024	Image Captioningreinforcement-learning	—Unverified
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks	Jul 29, 2024	Deep LearningDomain Generalization	—Unverified
Wasserstein Barycenter Model Ensembling	May 1, 2019	AttributeGeneral Classification	—Unverified
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset	Dec 1, 2022	Image CaptioningImage Generation	—Unverified
Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models	Sep 10, 2020	Caption GenerationDenoising	—Unverified
WEmbSim: A Simple yet Effective Metric for Image Captioning	Dec 24, 2020	Image CaptioningWord Embeddings	—Unverified
What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning	Oct 31, 2023	Image CaptioningSentence	—Unverified
What Else Would I Like? A User Simulator using Alternatives for Improved Evaluation of Fashion Conversational Recommendation Systems	Jan 11, 2024	Conversational RecommendationImage Captioning	—Unverified
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness	Feb 19, 2025	Image CaptioningKeyword Extraction	—Unverified
What is not where: the challenge of integrating spatial representations into deep learning architectures	Jul 21, 2018	Caption GenerationDeep Learning	—Unverified
When Radiology Report Generation Meets Knowledge Graph	Feb 19, 2020	Graph EmbeddingImage Captioning	—Unverified
When to Finish? Optimal Beam Search for Neural Text Generation (modulo beam size)	Aug 31, 2018	Image CaptioningMachine Translation	—Unverified
Where to Play: Retrieval of Video Segments using Natural-Language Queries	Jul 2, 2017	Image CaptioningNatural Language Queries	—Unverified
“Wikily” Supervised Neural Translation Tailored to Cross-Lingual Tasks	Nov 1, 2021	Cross-Lingual TransferCross-Lingual Word Embeddings	—Unverified
WMT 2016 Multimodal Translation System Description based on Bidirectional Recurrent Neural Networks with Double-Embeddings	Aug 1, 2016	Image CaptioningLanguage Modeling	—Unverified
Women also Snowboard: Overcoming Bias in Captioning Models (Extended Abstract)	Jul 2, 2018	Image Captioning	—Unverified
Would Deep Generative Models Amplify Bias in Future Models?	Apr 4, 2024	Image CaptioningImage Generation	—Unverified

Show:10 25 50

← PrevPage 28 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified