Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 751–800 of 1878 papers

Title	Date	Tasks	Status	Score
Discriminability objective for training descriptive captions	Mar 12, 2018	Caption GenerationDescriptive	CodeCode Available	5
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning	Mar 6, 2025	DescriptiveImage Captioning	CodeCode Available	5
Iconographic Image Captioning for Artworks	Feb 7, 2021	Image Captioning	CodeCode Available	5
ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding	Oct 19, 2023	Image CaptioningLanguage Modeling	CodeCode Available	5
ILLUME: Rationalizing Vision-Language Models through Human Interactions	Aug 17, 2022	Image CaptioningQuestion Answering	CodeCode Available	5
Semantic Object Accuracy for Generative Text-to-Image Synthesis	Oct 29, 2019	Image CaptioningImage Generation	CodeCode Available	5
ICECAP: Information Concentrated Entity-aware Image Captioning	Aug 4, 2021	ArticlesImage Captioning	CodeCode Available	5
Geometry Attention Transformer with Position-aware LSTMs for Image Captioning	Oct 1, 2021	DecoderImage Captioning	CodeCode Available	5
Image2tweet: Datasets in Hindi and English for Generating Tweets from Images	Dec 1, 2021	Image CaptioningWorld Knowledge	CodeCode Available	5
How Time Matters: Learning Time-Decay Attention for Contextual Spoken Language Understanding in Dialogues	Jun 1, 2018	Dialogue State TrackingImage Captioning	CodeCode Available	5
BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning	Sep 26, 2023	Image CaptioningTransfer Learning	CodeCode Available	5
HICEScore: A Hierarchical Metric for Image Captioning Evaluation	Jul 26, 2024	DescriptiveImage Captioning	CodeCode Available	5
Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding	Jan 5, 2024	Image CaptioningMachine Translation	CodeCode Available	5
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage	Aug 14, 2023	Image CaptioningRetrieval	CodeCode Available	5
Aesthetic Image Captioning From Weakly-Labelled Photographs	Aug 29, 2019	Aesthetic Image CaptioningBenchmarking	CodeCode Available	5
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	Aug 22, 2022	AllCross-Modal Retrieval	CodeCode Available	5
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis	Sep 21, 2023	Cross-Modal RetrievalImage Captioning	CodeCode Available	5
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images	Apr 25, 2015	Image CaptioningNovel Concepts	CodeCode Available	5
Surprisingly Easy Hard-Attention for Sequence to Sequence Learning	Oct 1, 2018	Hard AttentionImage Captioning	CodeCode Available	5
Differentiable Expected BLEU for Text Generation	Sep 27, 2018	Image CaptioningMachine Translation	—Unverified	0
DiffCap: Exploring Continuous Diffusion on Image Captioning	May 20, 2023	Caption GenerationDiversity	—Unverified	0
Cap2Aug: Caption guided Image to Image data Augmentation	Dec 11, 2022	ClassificationCross-Domain Few-Shot	—Unverified	0
Dietary Assessment with Multimodal ChatGPT: A Systematic Analysis	Dec 14, 2023	Image CaptioningScene Understanding	—Unverified	0
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)	Jan 19, 2019	Image CaptioningImage Retrieval	—Unverified	0
Dialog Generation Using Multi-Turn Reasoning Neural Networks	Jun 1, 2018	Constituency ParsingImage Captioning	—Unverified	0
Diagnostic Captioning: A Survey	Jan 18, 2021	DiagnosticImage Captioning	—Unverified	0
DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps	Feb 3, 2023	Image CaptioningOptical Character Recognition (OCR)	—Unverified	0
Annotation of Online Shopping Images without Labeled Training Examples	Jun 1, 2013	Image CaptioningObject Recognition	—Unverified	0
Aesthetic Critiques Generation for Photos	Oct 1, 2017	Image Captioning	—Unverified	0
Designing a Robust Radiology Report Generation System	Nov 2, 2024	Decision MakingDiagnostic	—Unverified	0
Improving Medical Visual Representations via Radiology Report Generation	Oct 30, 2023	Contrastive LearningDecoder	—Unverified	0
Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli	Jan 19, 2018	Image CaptioningSentence	—Unverified	0
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning	May 24, 2017	Image CaptioningSentence	—Unverified	0
Annotating Modality Expressions and Event Factuality for a Japanese Chess Commentary Corpus	May 1, 2018	Image CaptioningText Generation	—Unverified	0
Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs	Feb 10, 2022	Dense CaptioningImage Captioning	—Unverified	0
Bidirectional Awareness Induction in Autoregressive Seq2Seq Models	Aug 25, 2024	Image CaptioningMachine Translation	—Unverified	0
Describe Anything in Medical Images	May 9, 2025	AttributeDiagnostic	—Unverified	0
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand	Jan 16, 2022	Image CaptioningMachine Translation	—Unverified	0
An Interpretable Model for Scene Graph Generation	Nov 21, 2018	Graph GenerationImage Captioning	—Unverified	0
Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2	Aug 9, 2022	AttributeImage Captioning	—Unverified	0
A Comprehensive Analysis of Real-World Image Captioning and Scene Identification	Aug 5, 2023	DescriptiveImage Captioning	—Unverified	0
Dependent Multi-Task Learning with Causal Intervention for Image Captioning	May 18, 2021	Image CaptioningMulti-agent Reinforcement Learning	—Unverified	0
BFGAN: Backward and Forward Generative Adversarial Networks for Lexically Constrained Sentence Generation	Jun 21, 2018	Image CaptioningMachine Translation	—Unverified	0
An Integrated Approach for Video Captioning and Applications	Jan 23, 2022	Image CaptioningVideo Captioning	—Unverified	0
Dense Image Representation with Spatial Pyramid VLAD Coding of CNN for Locally Robust Captioning	Mar 30, 2016	General ClassificationImage Captioning	—Unverified	0
Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models	Sep 17, 2021	Caption GenerationDenoising	—Unverified	0
An In-depth Walkthrough on Evolution of Neural Machine Translation	Apr 10, 2020	Abstractive Text SummarizationImage Captioning	—Unverified	0
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning	Sep 28, 2024	HallucinationImage Captioning	—Unverified	0
DeltaNet: Conditional Medical Report Generation for COVID-19 Diagnosis	Oct 1, 2022	COVID-19 DiagnosisDecoder	—Unverified	0
Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis	May 1, 2024	Image CaptioningQuestion Answering	—Unverified	0

Show:10 25 50

← PrevPage 16 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified