Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 751–800 of 1878 papers

Title	Date	Tasks	Status
AstroLLaVA: towards the unification of astronomical data and natural language	Apr 11, 2025	AstronomyImage Captioning	—Unverified
Generative Distribution Prediction: A Unified Approach to Multimodal Learning	Feb 10, 2025	Domain AdaptationImage Captioning	—Unverified
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal Analysis	May 24, 2023	Image CaptioningNatural Language Understanding	—Unverified
Exploring the Functional and Geometric Bias of Spatial Relations Using Neural Language Models	Jun 1, 2018	Image Captioning	—Unverified
CLAMP: Contrastive LAnguage Model Prompt-tuning	Dec 4, 2023	Contrastive LearningImage Captioning	—Unverified
Consensus Graph Representation Learning for Better Grounded Image Captioning	Dec 2, 2021	Graph Representation LearningHallucination	—Unverified
Attr2Style: A Transfer Learning Approach for Inferring Fashion Styles via Apparel Attributes	Aug 26, 2020	AttributeImage Captioning	—Unverified
Geometry-Entangled Visual Semantic Transformer for Image Captioning	Sep 29, 2021	Caption GenerationImage Captioning	—Unverified
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing	Mar 16, 2025	Change DetectionImage Captioning	—Unverified
GeoSeq2Seq: Information Geometric Sequence-to-Sequence Networks	Oct 25, 2017	Image CaptioningTranslation	—Unverified
Image captioning in different languages	May 31, 2024	Image CaptioningPosition	—Unverified
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions	Feb 20, 2024	Image CaptioningQuestion Answering	—Unverified
Exploring Spatial Language Grounding Through Referring Expressions	Feb 4, 2025	Image CaptioningNegation	—Unverified
Exploring Semantic Relationships for Unpaired Image Captioning	Jun 20, 2021	Image CaptioningSentence	—Unverified
Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style	Oct 15, 2019	DecoderImage Captioning	—Unverified
CLAIR: Evaluating Image Captions with Large Language Models	Oct 19, 2023	DiversityImage Captioning	—Unverified
Context-Aware Group Captioning via Self-Attention and Contrastive Features	Apr 7, 2020	Image Captioning	—Unverified
Improving mitosis detection on histopathology images using large vision-language models	Oct 11, 2023	Domain GeneralizationImage Captioning	—Unverified
Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning	Feb 22, 2025	DecoderImage Captioning	—Unverified
Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks	Sep 29, 2021	Edge-computingFace Detection	—Unverified
Image Captioning in news report scenario	Mar 24, 2024	Image CaptioningRecommendation Systems	—Unverified
Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity	Mar 31, 2025	Image CaptioningOptical Character Recognition	—Unverified
Contextual Emotion Estimation from Image Captions	Sep 22, 2023	Image CaptioningLanguage Modelling	—Unverified
A Unified Sequence Interface for Vision Tasks	Jun 15, 2022	Image CaptioningInstance Segmentation	—Unverified
Graph Neural Networks in Vision-Language Image Understanding: A Survey	Mar 7, 2023	Image CaptioningImage Retrieval	—Unverified
Image Captioning using Multiple Transformers for Self-Attention Mechanism	Feb 14, 2021	Image Captioning	—Unverified
GraphSeq2Seq: Graph-Sequence-to-Sequence for Neural Machine Translation	Sep 27, 2018	DecoderImage Captioning	—Unverified
Green Runner: A tool for efficient model selection from model repositories	May 26, 2023	Deep LearningImage Captioning	—Unverified
Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models	May 20, 2025	HallucinationImage Captioning	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Mar 12, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified
Image Captioning based on Deep Reinforcement Learning	Sep 13, 2018	Deep Reinforcement LearningImage Captioning	—Unverified
Exploring External Knowledge for Accurate modeling of Visual and Language Problems	Jan 27, 2023	Image CaptioningMachine Translation	—Unverified
Group-based Distinctive Image Captioning with Memory Attention	Aug 20, 2021	Contrastive LearningImage Captioning	—Unverified
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention	Apr 3, 2025	Caption GenerationContrastive Learning	—Unverified
GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints	Jun 1, 2018	DiversityImage Captioning	—Unverified
AutoCaption: Image Captioning with Neural Architecture Search	Dec 16, 2020	DecoderImage Captioning	—Unverified
Grow and Prune Compact, Fast, and Accurate LSTMs	May 30, 2018	Image Captioningspeech-recognition	—Unverified
Exploring Explicit and Implicit Visual Relationships for Image Captioning	May 6, 2021	DecoderImage Captioning	—Unverified
Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language	Jun 28, 2024	Image Captioning	—Unverified
Guide Me: Interacting with Deep Networks	Mar 30, 2018	Image CaptioningImage Generation	—Unverified
Guiding Attention using Partial-Order Relationships for Image Captioning	Apr 15, 2022	Caption GenerationImage Captioning	—Unverified
CIC: A Framework for Culturally-Aware Image Captioning	Feb 8, 2024	DescriptiveImage Captioning	—Unverified
Image Captioning based on Feature Refinement and Reflective Decoding	Jun 16, 2022	DecoderImage Captioning	—Unverified
Chittron: An Automatic Bangla Image Captioning System	Sep 2, 2018	Caption GenerationImage Captioning	—Unverified
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning	May 25, 2023	Caption GenerationDecoder	—Unverified
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models	Feb 24, 2025	HallucinationImage Captioning	—Unverified
Cheap-fake Detection with LLM using Prompt Engineering	Jun 5, 2023	Image CaptioningImage Generation	—Unverified

Show:10 25 50

← PrevPage 16 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified