Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1151–1175 of 1878 papers

Title	Date	Tasks	Status
Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning	Nov 1, 2021	Image CaptioningRetrieval	—Unverified
Retrieval-Augmented Multimodal Language Modeling	Nov 22, 2022	Caption GenerationImage Captioning	—Unverified
Retrieval-Augmented Transformer for Image Captioning	Jul 26, 2022	Image CaptioningRetrieval	—Unverified
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation	Mar 25, 2025	Image CaptioningImage Generation	—Unverified
Review Networks for Caption Generation	May 25, 2016	Caption GenerationDecoder	—Unverified
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning	Feb 9, 2023	Few-Shot LearningImage Captioning	—Unverified
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting	Feb 20, 2025	Image Captioningmultimodal interaction	—Unverified
Revisiting Bayes by Backprop	Jan 1, 2018	Image CaptioningLanguage Modelling	—Unverified
Rich Image Captioning in the Wild	Mar 30, 2016	Image Captioning	—Unverified
Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation	Dec 15, 2020	Image CaptioningRobot Navigation	—Unverified
Robust Cross-Modal Representation Learning with Progressive Self-Distillation	Apr 10, 2022	Contrastive LearningImage Captioning	—Unverified
Robust Image Captioning	Dec 6, 2020	Image CaptioningText Generation	—Unverified
RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model	Sep 3, 2023	Decision MakingImage Captioning	—Unverified
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering	Nov 3, 2024	DescriptiveImage Captioning	—Unverified
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Apr 7, 2025	Image Captioningimage-classification	—Unverified
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified
Contextually Plausible and Diverse 3D Human Motion Prediction	Dec 18, 2019	DiversityHuman motion prediction	—Unverified
SANVis: Visual Analytics for Understanding Self-Attention Networks	Sep 13, 2019	Image CaptioningMachine Translation	—Unverified
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping	May 19, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data	Apr 4, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Scaling Up Vision-Language Pre-training for Image Captioning	Nov 24, 2021	AttributeImage Captioning	—Unverified
Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention	Apr 12, 2016	Handwriting RecognitionImage Captioning	—Unverified
Scene-based Factored Attention for Image Captioning	Aug 7, 2019	Caption GenerationDecoder	—Unverified
Scene Graph Generation for Better Image Captioning?	Sep 23, 2021	Caption GenerationGraph Generation	—Unverified
Scene Graph Generation with Geometric Context	Nov 25, 2021	Activity RecognitionGraph Generation	—Unverified

Show:10 25 50

← PrevPage 47 of 76Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified