Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 401–450 of 1878 papers

Title	Date	Tasks	Status	Hype
Order-Embeddings of Images and Language	Nov 19, 2015	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?	Nov 16, 2015	Image Captioning	CodeCode Available	1
A large annotated corpus for learning natural language inference	Aug 21, 2015	Image CaptioningNatural Language Inference	CodeCode Available	1
VQA: Visual Question Answering	May 3, 2015	Image CaptioningMultiple-choice	CodeCode Available	1
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention	Feb 10, 2015	Caption GenerationImage Captioning	CodeCode Available	1
CIDEr: Consensus-based Image Description Evaluation	Nov 20, 2014	Action RecognitionAttribute	CodeCode Available	1
Show and Tell: A Neural Image Caption Generator	Nov 17, 2014	Image CaptioningImage Retrieval with Multi-Modal Query	CodeCode Available	1
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos	Jul 16, 2025	Image CaptioningRepresentation Learning	—Unverified	0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval	Jun 28, 2025	Cross-Modal RetrievalImage Captioning	—Unverified	0
HalLoc: Token-level Localization of Hallucinations for Vision Language Models	Jun 12, 2025	HallucinationImage Captioning	CodeCode Available	0
A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning	Jun 11, 2025	DecoderImage Captioning	—Unverified	0
Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring	Jun 10, 2025	Image Captioning	—Unverified	0
Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings	Jun 10, 2025	Image Captioning	CodeCode Available	0
Edit Flows: Flow Matching with Edit Operations	Jun 10, 2025	Code GenerationImage Captioning	—Unverified	0
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models	Jun 10, 2025	Action GenerationImage Captioning	—Unverified	0
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition	Jun 9, 2025	Image Captioning	CodeCode Available	0
Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning	Jun 8, 2025	AttributeHallucination	—Unverified	0
Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation	Jun 7, 2025	Camouflaged Object SegmentationFeature Correlation	CodeCode Available	0
SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs	Jun 5, 2025	backdoor defenseImage Captioning	—Unverified	0
Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation	Jun 3, 2025	Caption GenerationImage Captioning	—Unverified	0
Light as Deception: GPT-driven Natural Relighting Against Vision-Language Pre-training Models	May 30, 2025	Image CaptioningQuestion Answering	—Unverified	0
Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model	May 29, 2025	Image CaptioningLanguage Modeling	—Unverified	0
CLDTracker: A Comprehensive Language Description for Visual Tracking	May 29, 2025	Image CaptioningVisual Tracking	CodeCode Available	0
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport	May 29, 2025	Document Level Machine TranslationImage Captioning	CodeCode Available	0
Correlating instruction-tuning (in multimodal models) with vision-language processing (in the brain)	May 26, 2025	Image Captioning	CodeCode Available	0
TNG-CLIP:Training-Time Negation Data Generation for Negation Awareness of CLIP	May 24, 2025	Image CaptioningImage Generation	—Unverified	0
Redemption Score: An Evaluation Framework to Rank Image Captions While Redeeming Image Semantics and Language Pragmatics	May 22, 2025	Image Captioningtext similarity	—Unverified	0
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation	May 22, 2025	HallucinationImage Captioning	—Unverified	0
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval	May 21, 2025	counterfactualGraph Generation	CodeCode Available	0
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI	May 20, 2025	Anomaly LocalizationBenchmarking	—Unverified	0
MedBLIP: Fine-tuning BLIP for Medical Image Captioning	May 20, 2025	DecoderImage Captioning	—Unverified	0
Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models	May 20, 2025	HallucinationImage Captioning	—Unverified	0
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding	May 20, 2025	Image CaptioningQuestion Answering	CodeCode Available	0
Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping	May 19, 2025	Contrastive LearningCross-Modal Retrieval	—Unverified	0
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models	May 16, 2025	Image CaptioningQuestion Answering	CodeCode Available	0
Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models	May 15, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Describe Anything in Medical Images	May 9, 2025	AttributeDiagnostic	—Unverified	0
ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding	May 9, 2025	Image CaptioningObject Recognition	—Unverified	0
A Grounded Memory System For Smart Personal Assistants	May 9, 2025	Entity DisambiguationImage Captioning	—Unverified	0
Mitigating Image Captioning Hallucinations in Vision-Language Models	May 6, 2025	HallucinationHallucination Evaluation	—Unverified	0
Compositional Image-Text Matching and Retrieval by Grounding Entities	May 4, 2025	Image CaptioningImage-text matching	CodeCode Available	0
Transferable Adversarial Attacks on Black-Box Vision-Language Models	May 2, 2025	Image CaptioningObject Recognition	—Unverified	0
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM	Apr 30, 2025	Image CaptioningObject Recognition	—Unverified	0
MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation	Apr 29, 2025	cross-modal alignmentDecoder	CodeCode Available	0
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning	Apr 21, 2025	Image Captioning	—Unverified	0
Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding	Apr 20, 2025	Autonomous DrivingImage Captioning	CodeCode Available	0
Generalized Visual Relation Detection with Diffusion Models	Apr 16, 2025	Graph GenerationHuman-Object Interaction Detection	—Unverified	0
LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation	Apr 15, 2025	Image CaptioningQuestion Answering	—Unverified	0
TADACap: Time-series Adaptive Domain-Aware Captioning	Apr 15, 2025	Image CaptioningRetrieval	—Unverified	0
Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization	Apr 14, 2025	BenchmarkingEarth Observation	—Unverified	0

Show:10 25 50

← PrevPage 9 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified