Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1201–1225 of 1878 papers

Title	Date	Tasks	Status
Are metrics measuring what they should? An evaluation of image captioning task metrics	Jul 4, 2022	Image Captioning	—Unverified
A Review of Multi-Modal Large Language and Vision Models	Mar 28, 2024	Image CaptioningPrompt Engineering	—Unverified
ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding	May 9, 2025	Image CaptioningObject Recognition	—Unverified
A Scaled Encoder Decoder Network for Image Captioning in Hindi	Dec 1, 2021	DecoderDeep Learning	—Unverified
A Self-Boosting Framework for Automated Radiographic Report Generation	Jun 19, 2021	Image CaptioningImage-text matching	—Unverified
A Self-Explainable Stylish Image Captioning Framework via Multi-References	Oct 20, 2021	Image Captioning	—Unverified
A Self-Guided Framework for Radiology Report Generation	Jun 19, 2022	Image CaptioningMedical Report Generation	—Unverified
A sequential guiding network with attention for image captioning	Nov 1, 2018	DecoderImage Captioning	—Unverified
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?	Mar 19, 2024	Adversarial AttackImage Captioning	—Unverified
Assessing Image Quality Issues for Real-World Problems	Mar 27, 2020	Image CaptioningQuestion Answering	—Unverified
Assisting Scene Graph Generation with Self-Supervision	Aug 8, 2020	Graph GenerationImage Captioning	—Unverified
Assistive Image Annotation Systems with Deep Learning and Natural Language Capabilities: A Review	Jun 28, 2024	Active LearningImage Captioning	—Unverified
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Mar 12, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified
AstroLLaVA: towards the unification of astronomical data and natural language	Apr 11, 2025	AstronomyImage Captioning	—Unverified
A Survey of Evaluation Metrics Used for NLG Systems	Aug 27, 2020	Image Captioningnlg evaluation	—Unverified
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation	Jun 12, 2023	Image CaptioningMachine Translation	—Unverified
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified
A Survey on Large Language Models from Concept to Implementation	Mar 27, 2024	ChatbotImage Captioning	—Unverified
Asynchronous Evolution of Deep Neural Network Architectures	Aug 8, 2023	Evolutionary AlgorithmsImage Captioning	—Unverified
A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning	Sep 27, 2024	DecoderFairness	—Unverified
A Thorough Review on Recent Deep Learning Methodologies for Image Captioning	Jul 28, 2021	Caption GenerationDescriptive	—Unverified
A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)	May 2, 2024	Acoustic Scene ClassificationEvent Detection	—Unverified
Attend More Times for Image Captioning	Dec 8, 2018	Image Captioning	—Unverified
Attention-based Multimodal Neural Machine Translation	Aug 1, 2016	Image CaptioningMachine Translation	—Unverified
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation	Jun 3, 2025	Caption GenerationImage Captioning	—Unverified

Show:10 25 50

← PrevPage 49 of 76Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified