Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–100 of 1878 papers

Title	Date	Tasks	Status	Hype
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging	Apr 14, 2025	Anomaly DetectionDiagnostic	CodeCode Available	1
A Survey on Efficient Vision-Language Models	Apr 13, 2025	Image CaptioningQuestion Answering	CodeCode Available	1
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions	Apr 13, 2025	Image CaptioningTAG	—Unverified	0
Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference	Apr 13, 2025	Bayesian InferenceImage Captioning	—Unverified	0
AstroLLaVA: towards the unification of astronomical data and natural language	Apr 11, 2025	AstronomyImage Captioning	—Unverified	0
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions	Apr 11, 2025	Contrastive LearningImage Captioning	—Unverified	0
Impact of Language Guidance: A Reproducibility Study	Apr 10, 2025	Contrastive LearningImage Captioning	—Unverified	0
How Can Objects Help Video-Language Understanding?	Apr 10, 2025	Image CaptioningObject	—Unverified	0
OmniCaptioner: One Captioner to Rule Them All	Apr 9, 2025	AllImage Captioning	CodeCode Available	2
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Apr 7, 2025	Image Captioningimage-classification	—Unverified	0
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories	Apr 4, 2025	Decision MakingImage Captioning	—Unverified	0
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention	Apr 3, 2025	Caption GenerationContrastive Learning	—Unverified	0
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates	Apr 1, 2025	Image Captioning	—Unverified	0
Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity	Mar 31, 2025	Image CaptioningOptical Character Recognition	—Unverified	0
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning	Mar 30, 2025	Graph AttentionImage Captioning	—Unverified	0
JEEM: Vision-Language Understanding in Four Arabic Dialects	Mar 27, 2025	Image CaptioningQuestion Answering	—Unverified	0
Unified Multimodal Discrete Diffusion	Mar 26, 2025	Image CaptioningImage Generation	CodeCode Available	2
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy	Mar 26, 2025	HallucinationImage Captioning	—Unverified	0
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	Mar 25, 2025	BenchmarkingImage Captioning	CodeCode Available	1
Improved Alignment of Modalities in Large Vision Language Models	Mar 25, 2025	GPUImage Captioning	—Unverified	0
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation	Mar 25, 2025	Image CaptioningImage Generation	—Unverified	0
Natural Language Generation	Mar 20, 2025	Image CaptioningImage to text	—Unverified	0
UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation	Mar 20, 2025	Image CaptioningTransfer Learning	CodeCode Available	0
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives	Mar 18, 2025	Image Captioning	CodeCode Available	1
Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic	Mar 18, 2025	General KnowledgeImage Captioning	CodeCode Available	0
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	Mar 17, 2025	Image CaptioningImage Generation	—Unverified	0
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era	Mar 16, 2025	BenchmarkingImage Captioning	—Unverified	0
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing	Mar 16, 2025	Change DetectionImage Captioning	—Unverified	0
Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition	Mar 16, 2025	Caption GenerationImage Captioning	CodeCode Available	1
Falcon: A Remote Sensing Vision-Language Foundation Model	Mar 14, 2025	Image Captioningimage-classification	CodeCode Available	3
RONA: Pragmatically Diverse Image Captioning with Coherence Relations	Mar 14, 2025	DiversityImage Captioning	CodeCode Available	0
Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification	Mar 13, 2025	Image CaptioningRAG	—Unverified	0
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models	Mar 12, 2025	Cross-Lingual TransferImage Captioning	—Unverified	0
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Mar 12, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified	0
ComicsPAP: understanding comic strips by picking the correct panel	Mar 11, 2025	Image CaptioningVisual Question Answering (VQA)	—Unverified	0
Measuring directional bias amplification in image captions using predictability	Mar 10, 2025	Image Captioningimage-classification	—Unverified	0
Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images	Mar 10, 2025	DiagnosticImage Captioning	—Unverified	0
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training	Mar 9, 2025	HallucinationImage Captioning	—Unverified	0
From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models	Mar 8, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Treble Counterfactual VLMs: A Causal Approach to Hallucination	Mar 8, 2025	Autonomous Drivingcounterfactual	CodeCode Available	0
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model	Mar 6, 2025	General KnowledgeImage Captioning	CodeCode Available	2
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning	Mar 6, 2025	DescriptiveImage Captioning	CodeCode Available	0
AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language	Mar 3, 2025	DecoderImage Captioning	—Unverified	0
Group Relative Policy Optimization for Image Captioning	Mar 3, 2025	DiversityImage Captioning	CodeCode Available	0
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models	Feb 24, 2025	HallucinationImage Captioning	—Unverified	0
Are Large Language Models Good Data Preprocessors?	Feb 24, 2025	Image Captioning	—Unverified	0
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts	Feb 24, 2025	BenchmarkingFact Verification	CodeCode Available	2
Fine-Grained Video Captioning through Scene Graph Consolidation	Feb 23, 2025	Caption GenerationImage Captioning	—Unverified	0
Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning	Feb 22, 2025	DecoderImage Captioning	—Unverified	0
Weakly Supervised Video Scene Graph Generation via Natural Language Supervision	Feb 21, 2025	Graph GenerationImage Captioning	CodeCode Available	1

Show:10 25 50

← PrevPage 2 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified