Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 1878 papers

Title	Date	Tasks	Status
Building Trustworthy Multimodal AI: A Review of Fairness, Transparency, and Ethics in Vision-Language Tasks	Apr 14, 2025	EthicsFairness	—Unverified
AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions	Apr 13, 2025	Image CaptioningTAG	—Unverified
Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference	Apr 13, 2025	Bayesian InferenceImage Captioning	—Unverified
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions	Apr 11, 2025	Contrastive LearningImage Captioning	—Unverified
AstroLLaVA: towards the unification of astronomical data and natural language	Apr 11, 2025	AstronomyImage Captioning	—Unverified
Impact of Language Guidance: A Reproducibility Study	Apr 10, 2025	Contrastive LearningImage Captioning	—Unverified
How Can Objects Help Video-Language Understanding?	Apr 10, 2025	Image CaptioningObject	—Unverified
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Apr 7, 2025	Image Captioningimage-classification	—Unverified
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories	Apr 4, 2025	Decision MakingImage Captioning	—Unverified
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention	Apr 3, 2025	Caption GenerationContrastive Learning	—Unverified
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates	Apr 1, 2025	Image Captioning	—Unverified
Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity	Mar 31, 2025	Image CaptioningOptical Character Recognition	—Unverified
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning	Mar 30, 2025	Graph AttentionImage Captioning	—Unverified
JEEM: Vision-Language Understanding in Four Arabic Dialects	Mar 27, 2025	Image CaptioningQuestion Answering	—Unverified
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy	Mar 26, 2025	HallucinationImage Captioning	—Unverified
Improved Alignment of Modalities in Large Vision Language Models	Mar 25, 2025	GPUImage Captioning	—Unverified
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation	Mar 25, 2025	Image CaptioningImage Generation	—Unverified
UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation	Mar 20, 2025	Image CaptioningTransfer Learning	CodeCode Available
Natural Language Generation	Mar 20, 2025	Image CaptioningImage to text	—Unverified
Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic	Mar 18, 2025	General KnowledgeImage Captioning	CodeCode Available
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	Mar 17, 2025	Image CaptioningImage Generation	—Unverified
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing	Mar 16, 2025	Change DetectionImage Captioning	—Unverified
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era	Mar 16, 2025	BenchmarkingImage Captioning	—Unverified
RONA: Pragmatically Diverse Image Captioning with Coherence Relations	Mar 14, 2025	DiversityImage Captioning	CodeCode Available
Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification	Mar 13, 2025	Image CaptioningRAG	—Unverified
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models	Mar 12, 2025	Cross-Lingual TransferImage Captioning	—Unverified
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Mar 12, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified
ComicsPAP: understanding comic strips by picking the correct panel	Mar 11, 2025	Image CaptioningVisual Question Answering (VQA)	—Unverified
Measuring directional bias amplification in image captions using predictability	Mar 10, 2025	Image Captioningimage-classification	—Unverified
Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images	Mar 10, 2025	DiagnosticImage Captioning	—Unverified
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training	Mar 9, 2025	HallucinationImage Captioning	—Unverified
From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models	Mar 8, 2025	Image CaptioningLanguage Modeling	—Unverified
Treble Counterfactual VLMs: A Causal Approach to Hallucination	Mar 8, 2025	Autonomous Drivingcounterfactual	CodeCode Available
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning	Mar 6, 2025	DescriptiveImage Captioning	CodeCode Available
Group Relative Policy Optimization for Image Captioning	Mar 3, 2025	DiversityImage Captioning	CodeCode Available
AC-Lite : A Lightweight Image Captioning Model for Low-Resource Assamese Language	Mar 3, 2025	DecoderImage Captioning	—Unverified
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models	Feb 24, 2025	HallucinationImage Captioning	—Unverified
Are Large Language Models Good Data Preprocessors?	Feb 24, 2025	Image Captioning	—Unverified
Fine-Grained Video Captioning through Scene Graph Consolidation	Feb 23, 2025	Caption GenerationImage Captioning	—Unverified
Good Representation, Better Explanation: Role of Convolutional Neural Networks in Transformer-Based Remote Sensing Image Captioning	Feb 22, 2025	DecoderImage Captioning	—Unverified
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting	Feb 20, 2025	Image Captioningmultimodal interaction	—Unverified
A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models	Feb 19, 2025	Image CaptioningLanguage Modeling	—Unverified
InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models	Feb 19, 2025	Image Captioning	—Unverified
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified
Pretrained Image-Text Models are Secretly Video Captioners	Feb 19, 2025	Image CaptioningVideo Captioning	CodeCode Available
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness	Feb 19, 2025	Image CaptioningKeyword Extraction	—Unverified
TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules	Feb 16, 2025	GPUImage Captioning	—Unverified
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning	Feb 13, 2025	Caption GenerationDecoder	—Unverified
Vision-Language Models for Edge Networks: A Comprehensive Survey	Feb 11, 2025	Autonomous VehiclesImage Captioning	—Unverified

Show:10 25 50

← PrevPage 10 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified