Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1651–1675 of 1878 papers

Title	Date	Tasks	Status
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	Aug 22, 2022	AllCross-Modal Retrieval	CodeCode Available
DenseCap: Fully Convolutional Localization Networks for Dense Captioning	Nov 24, 2015	Dense CaptioningImage Captioning	CodeCode Available
Human Attention in Image Captioning: Dataset and Analysis	Mar 6, 2019	Image CaptioningImage Description	CodeCode Available
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding	May 5, 2023	ArticlesImage Captioning	CodeCode Available
Multilingual Image Description with Neural Sequence Models	Oct 15, 2015	Image CaptioningImage Description	CodeCode Available
Image2tweet: Datasets in Hindi and English for Generating Tweets from Images	Dec 1, 2021	Image CaptioningWorld Knowledge	CodeCode Available
Semantic Map-based Generation of Navigation Instructions	Mar 28, 2024	Image Captioning	CodeCode Available
Multimodal Data Augmentation for Image Captioning using Diffusion Models	May 3, 2023	Data AugmentationImage Captioning	CodeCode Available
CLID: Controlled-Length Image Descriptions with Limited Data	Nov 27, 2022	controllable image captioningImage Captioning	CodeCode Available
A Hybrid Model for Combining Neural Image Caption and k-Nearest Neighbor Approach for Image Captioning	May 9, 2021	Image Captioningregression	CodeCode Available
CLDTracker: A Comprehensive Language Description for Visual Tracking	May 29, 2025	Image CaptioningVisual Tracking	CodeCode Available
Class-Conditional self-reward mechanism for improved Text-to-Image models	May 22, 2024	Image Captioningobject-detection	CodeCode Available
ILLUME: Rationalizing Vision-Language Models through Human Interactions	Aug 17, 2022	Image CaptioningQuestion Answering	CodeCode Available
Semi-Autoregressive Image Captioning	Oct 11, 2021	DecoderImage Captioning	CodeCode Available
Delete, Retrieve, Generate: A Simple Approach to Sentiment and Style Transfer	Apr 17, 2018	AttributeImage Captioning	CodeCode Available
Multimodal Learning for Hateful Memes Detection	Nov 25, 2020	Image CaptioningMultimodal Deep Learning	CodeCode Available
Semantic Object Accuracy for Generative Text-to-Image Synthesis	Oct 29, 2019	Image CaptioningImage Generation	CodeCode Available
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning	Dec 26, 2024	Image CaptioningRetrieval	CodeCode Available
Training for Diversity in Image Paragraph Captioning	Oct 1, 2018	DiversityImage Captioning	CodeCode Available
Semi-supervised Multimodal Representation Learning through a Global Workspace	Jun 27, 2023	Image CaptioningImage Generation	CodeCode Available
Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity	Sep 7, 2024	Image CaptioningImage Retrieval	CodeCode Available
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text	May 18, 2018	DescriptiveImage Captioning	CodeCode Available
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation	Dec 20, 2024	Image Captioning	CodeCode Available
CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation	Jul 16, 2024	controllable image captioningData Augmentation	CodeCode Available
ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding	Oct 19, 2023	Image CaptioningLanguage Modeling	CodeCode Available

Show:10 25 50

← PrevPage 67 of 76Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified