Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 1878 papers

Title	Date	Tasks	Status	Hype
Towards Accurate Text-based Image Captioning with Content Diversity Exploration	Apr 23, 2021	Caption GenerationDiversity	CodeCode Available	1
Towards Local Visual Modeling for Image Captioning	Feb 13, 2023	Image CaptioningObject Recognition	CodeCode Available	1
Learning to Generate Grounded Visual Captions without Localization Supervision	Jun 1, 2019	Image CaptioningLanguage Modelling	CodeCode Available	1
IC3: Image Captioning by Committee Consensus	Feb 2, 2023	Image Captioning	CodeCode Available	1
MemCap: Memorizing Style Knowledge for Image Captioning	Apr 3, 2020	Image CaptioningLanguage Modeling	CodeCode Available	1
Convolutional Image Captioning	Nov 24, 2017	Image CaptioningText Generation	CodeCode Available	1
In Defense of Scene Graphs for Image Captioning	Feb 9, 2021	Human-Object Interaction DetectionImage Captioning	CodeCode Available	1
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)	Jan 19, 2019	Image CaptioningImage Retrieval	—Unverified	0
Aesthetic Critiques Generation for Photos	Oct 1, 2017	Image Captioning	—Unverified	0
Annotation of Online Shopping Images without Labeled Training Examples	Jun 1, 2013	Image CaptioningObject Recognition	—Unverified	0
A Comprehensive Analysis of Real-World Image Captioning and Scene Identification	Aug 5, 2023	DescriptiveImage Captioning	—Unverified	0
DS@BioMed at ImageCLEFmedical Caption 2024: Enhanced Attention Mechanisms in Medical Caption Generation through Concept Detection Integration	Jun 1, 2024	Caption GenerationImage Captioning	—Unverified	0
Improving Medical Visual Representations via Radiology Report Generation	Oct 30, 2023	Contrastive LearningDecoder	—Unverified	0
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning	May 24, 2017	Image CaptioningSentence	—Unverified	0
Annotating Modality Expressions and Event Factuality for a Japanese Chess Commentary Corpus	May 1, 2018	Image CaptioningText Generation	—Unverified	0
Bidirectional Awareness Induction in Autoregressive Seq2Seq Models	Aug 25, 2024	Image CaptioningMachine Translation	—Unverified	0
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand	Jan 16, 2022	Image CaptioningMachine Translation	—Unverified	0
An Interpretable Model for Scene Graph Generation	Nov 21, 2018	Graph GenerationImage Captioning	—Unverified	0
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2	Aug 9, 2022	AttributeImage Captioning	—Unverified	0
Dual Attention on Pyramid Feature Maps for Image Captioning	Nov 2, 2020	DescriptiveImage Captioning	—Unverified	0
BFGAN: Backward and Forward Generative Adversarial Networks for Lexically Constrained Sentence Generation	Jun 21, 2018	Image CaptioningMachine Translation	—Unverified	0
An Integrated Approach for Video Captioning and Applications	Jan 23, 2022	Image CaptioningVideo Captioning	—Unverified	0
An In-depth Walkthrough on Evolution of Neural Machine Translation	Apr 10, 2020	Abstractive Text SummarizationImage Captioning	—Unverified	0
DRAMA: Joint Risk Localization and Captioning in Driving	Sep 22, 2022	Image Captioning	—Unverified	0
Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis	May 1, 2024	Image CaptioningQuestion Answering	—Unverified	0
An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)	Jan 6, 2023	DecoderImage Captioning	—Unverified	0
A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation	Oct 11, 2023	Caption GenerationDecoder	—Unverified	0
Beyond Holistic Object Recognition: Enriching Image Understanding with Part States	Dec 15, 2016	Human-Object Interaction DetectionImage Captioning	—Unverified	0
Masked Non-Autoregressive Image Captioning	Jun 3, 2019	DecoderDiversity	—Unverified	0
Dropout during inference as a model for neurological degeneration in an image captioning network	Aug 11, 2018	Image Captioning	—Unverified	0
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions	Apr 13, 2025	Image CaptioningTAG	—Unverified	0
Beyond Caption To Narrative: Video Captioning With Multiple Sentences	May 18, 2016	Action LocalizationImage Captioning	—Unverified	0
A Neural-Symbolic Approach to Design of CAPTCHA	Oct 29, 2017	BIG-bench Machine LearningImage Captioning	—Unverified	0
Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators	Sep 22, 2019	Image CaptioningImage-text matching	—Unverified	0
Better Understanding Hierarchical Visual Relationship for Image Caption	Dec 4, 2019	DecoderImage Captioning	—Unverified	0
Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring	Jun 10, 2025	Image Captioning	—Unverified	0
A Neural Compositional Paradigm for Image Captioning	Sep 24, 2018	DiversityImage Captioning	—Unverified	0
Better Captioning with Sequence-Level Exploration	Mar 8, 2020	Image Captioning	—Unverified	0
ADVISE: Symbolism and External Knowledge for Decoding Advertisements	Nov 17, 2017	ClusteringImage Captioning	—Unverified	0
A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models	Feb 19, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Downstream-Pretext Domain Knowledge Traceback for Active Learning	Jul 20, 2024	Active LearningDiversity	—Unverified	0
A Neural Approach to Pun Generation	Jul 1, 2018	DiversityImage Captioning	—Unverified	0
Adversarial Semantic Alignment for Improved Image Captions	Jun 1, 2019	Image Captioning	—Unverified	0
Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm	Feb 11, 2022	Image CaptioningMulti-Task Learning	—Unverified	0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology	Sep 21, 2024	BenchmarkingDepth Estimation	—Unverified	0
An Ensemble Model with Attention Based Mechanism for Image Captioning	Jan 22, 2025	Ensemble LearningImage Captioning	—Unverified	0
Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models	Nov 25, 2024	AttributeComputational Efficiency	—Unverified	0
Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need	Dec 6, 2022	Image CaptioningInformation Retrieval	—Unverified	0
Dataset Augmentation by Mixing Visual Concepts	Dec 19, 2024	Image Captioning	—Unverified	0
Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model	May 29, 2025	Image CaptioningLanguage Modeling	—Unverified	0

Show:10 25 50

← PrevPage 9 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified