Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1151–1200 of 1878 papers

Title	Date	Tasks	Status
Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning	Nov 1, 2021	Image CaptioningRetrieval	—Unverified
Retrieval-Augmented Multimodal Language Modeling	Nov 22, 2022	Caption GenerationImage Captioning	—Unverified
Retrieval-Augmented Transformer for Image Captioning	Jul 26, 2022	Image CaptioningRetrieval	—Unverified
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation	Mar 25, 2025	Image CaptioningImage Generation	—Unverified
Review Networks for Caption Generation	May 25, 2016	Caption GenerationDecoder	—Unverified
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning	Feb 9, 2023	Few-Shot LearningImage Captioning	—Unverified
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting	Feb 20, 2025	Image Captioningmultimodal interaction	—Unverified
Revisiting Bayes by Backprop	Jan 1, 2018	Image CaptioningLanguage Modelling	—Unverified
Rich Image Captioning in the Wild	Mar 30, 2016	Image Captioning	—Unverified
Robots Understanding Contextual Information in Human-Centered Environments using Weakly Supervised Mask Data Distillation	Dec 15, 2020	Image CaptioningRobot Navigation	—Unverified
Robust Cross-Modal Representation Learning with Progressive Self-Distillation	Apr 10, 2022	Contrastive LearningImage Captioning	—Unverified
Robust Image Captioning	Dec 6, 2020	Image CaptioningText Generation	—Unverified
RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model	Sep 3, 2023	Decision MakingImage Captioning	—Unverified
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering	Nov 3, 2024	DescriptiveImage Captioning	—Unverified
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Apr 7, 2025	Image Captioningimage-classification	—Unverified
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified
Contextually Plausible and Diverse 3D Human Motion Prediction	Dec 18, 2019	DiversityHuman motion prediction	—Unverified
SANVis: Visual Analytics for Understanding Self-Attention Networks	Sep 13, 2019	Image CaptioningMachine Translation	—Unverified
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping	May 19, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data	Apr 4, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Scaling Up Vision-Language Pre-training for Image Captioning	Nov 24, 2021	AttributeImage Captioning	—Unverified
Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention	Apr 12, 2016	Handwriting RecognitionImage Captioning	—Unverified
Scene-based Factored Attention for Image Captioning	Aug 7, 2019	Caption GenerationDecoder	—Unverified
Scene Graph Generation for Better Image Captioning?	Sep 23, 2021	Caption GenerationGraph Generation	—Unverified
Scene Graph Generation with Geometric Context	Nov 25, 2021	Activity RecognitionGraph Generation	—Unverified
A Comprehensive Survey of Scene Graphs: Generation and Application	Mar 17, 2021	Image CaptioningQuestion Answering	—Unverified
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling	Feb 1, 2024	DiversityImage Captioning	—Unverified
SD-RSIC: Summarization Driven Deep Remote Sensing Image Captioning	Jun 15, 2020	Image Captioning	—Unverified
Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models	Nov 7, 2024	Adversarial AttackImage Captioning	—Unverified
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models	Dec 11, 2024	Image CaptioningImage Generation	—Unverified
Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels	Dec 22, 2015	Image Captioningimage-classification	—Unverified
Seeing with Humans: Gaze-Assisted Neural Image Captioning	Aug 18, 2016	Image CaptioningObject	—Unverified
See Your Heart: Psychological states Interpretation through Visual Creations	Feb 11, 2023	Emotion ClassificationImage Captioning	—Unverified
Self-Adaptive Scaling for Learnable Residual Structure	Nov 1, 2019	de-enImage Captioning	—Unverified
Self-Annotated Training for Controllable Image Captioning	Oct 16, 2021	controllable image captioningImage Captioning	—Unverified
Self-critical n-step Training for Image Captioning	Apr 15, 2019	Image CaptioningReinforcement Learning	—Unverified
Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning	Sep 15, 2017	Image CaptioningSentence	—Unverified
Semantically Invariant Text-to-Image Generation	Sep 27, 2018	Image CaptioningImage Generation	—Unverified
Semantic and Expressive Variations in Image Captions Across Languages	Jan 1, 2025	DescriptiveImage Captioning	—Unverified
Semantic-aware Image Deblurring	Oct 9, 2019	DeblurringImage Captioning	—Unverified
Semantic Composition in Visually Grounded Language Models	May 15, 2023	Image CaptioningInductive Bias	—Unverified
Semantic Distillation Guided Salient Object Detection	Mar 8, 2022	Image CaptioningObject	—Unverified
Semantic Exploration from Language Abstractions and Pretrained Representations	Apr 8, 2022	Image CaptioningReinforcement Learning (RL)	—Unverified
Semantic Regularisation for Recurrent Image Annotation	Nov 16, 2016	General ClassificationImage Captioning	—Unverified
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning	Mar 30, 2025	Graph AttentionImage Captioning	—Unverified
Semantic Tuples for Evaluation of Image to Sentence Generation	Sep 1, 2015	Image CaptioningMachine Translation	—Unverified
SemEval-2016 Task 2: Interpretable Semantic Textual Similarity	Jun 1, 2016	Image CaptioningSemantic Textual Similarity	—Unverified
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data	Jan 26, 2023	Image CaptioningRelational Captioning	—Unverified
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching	Mar 26, 2024	Data AugmentationGraph Matching	—Unverified
Self-Supervised Image Captioning with CLIP	Jun 26, 2023	Image CaptioningInformativeness	—Unverified

Show:10 25 50

← PrevPage 24 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified