Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1401–1450 of 1878 papers

Title	Date	Tasks	Status
Describing image focused in cognitive and visual details for visually impaired people: An approach to generating inclusive paragraphs	Feb 10, 2022	Dense CaptioningImage Captioning	—Unverified
Describing Semantic Representations of Brain Activity Evoked by Visual Stimuli	Jan 19, 2018	Image CaptioningSentence	—Unverified
Designing a Robust Radiology Report Generation System	Nov 2, 2024	Decision MakingDiagnostic	—Unverified
DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps	Feb 3, 2023	Image CaptioningOptical Character Recognition (OCR)	—Unverified
Diagnostic Captioning: A Survey	Jan 18, 2021	DiagnosticImage Captioning	—Unverified
Dialog Generation Using Multi-Turn Reasoning Neural Networks	Jun 1, 2018	Constituency ParsingImage Captioning	—Unverified
Dietary Assessment with Multimodal ChatGPT: A Systematic Analysis	Dec 14, 2023	Image CaptioningScene Understanding	—Unverified
Cap2Aug: Caption guided Image to Image data Augmentation	Dec 11, 2022	ClassificationCross-Domain Few-Shot	—Unverified
DiffCap: Exploring Continuous Diffusion on Image Captioning	May 20, 2023	Caption GenerationDiversity	—Unverified
Differentiable Expected BLEU for Text Generation	Sep 27, 2018	Image CaptioningMachine Translation	—Unverified
DIFNet: Boosting Visual Information Flow for Image Captioning	Jan 1, 2022	Image CaptioningPrediction	—Unverified
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention	Oct 28, 2022	Image CaptioningLanguage Modeling	—Unverified
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding	Dec 2, 2024	Caption GenerationDomain Generalization	—Unverified
Disambiguated skip-gram model	Oct 1, 2018	Image Captioningmodel	—Unverified
Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures	Jan 3, 2020	Change DetectionDescriptive	—Unverified
Discovering Non-Monotonic Autoregressive Ordering for Text Generation Models using Sinkhorn Distributions	Jan 17, 2022	Code GenerationDecoder	—Unverified
Discovery and usage of joint attention in images	Apr 10, 2018	Image Captioning	—Unverified
Positioning yourself in the maze of Neural Text Generation: A Task-Agnostic Survey	Oct 14, 2020	Image CaptioningMachine Translation	—Unverified
Distinctive Image Captioning via CLIP Guided Group Optimization	Aug 8, 2022	Image Captioning	—Unverified
Distinctive-attribute Extraction for Image Captioning	Jul 25, 2018	AttributeAttribute Extraction	—Unverified
Distributed Attention for Grounded Image Captioning	Aug 2, 2021	Image CaptioningSentence	—Unverified
Diverse and Coherent Paragraph Generation from Images	Sep 3, 2018	DiversityImage Captioning	—Unverified
Diversity as a By-Product: Goal-oriented Language Generation Leads to Linguistic Variation	Jul 1, 2021	DiversityImage Captioning	—Unverified
DLIP: Distilling Language-Image Pre-training	Aug 24, 2023	Image CaptioningImage-text Retrieval	—Unverified
Do DALL-E and Flamingo Understand Each Other?	Dec 23, 2022	Image CaptioningImage Generation	—Unverified
Does Multimodality Help Human and Machine for Translation and Image Captioning?	May 30, 2016	Image CaptioningImage Description	—Unverified
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?	Jun 20, 2024	Caption GenerationHallucination	—Unverified
Domain-Independent Captioning of Domain-Specific Images	Jun 1, 2013	Image CaptioningImage Retrieval	—Unverified
Domain-Specific Image Captioning	Jun 1, 2014	Image CaptioningSentence Compression	—Unverified
Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?	Jun 18, 2024	AttributeHallucination	—Unverified
Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness	Jan 16, 2025	Adversarial DefenseAdversarial Robustness	—Unverified
Doubly Attentive Transformer Machine Translation	Jul 30, 2018	DecoderImage Captioning	—Unverified
Downstream-Pretext Domain Knowledge Traceback for Active Learning	Jul 20, 2024	Active LearningDiversity	—Unverified
DRAMA: Joint Risk Localization and Captioning in Driving	Sep 22, 2022	Image Captioning	—Unverified
Dropout during inference as a model for neurological degeneration in an image captioning network	Aug 11, 2018	Image Captioning	—Unverified
DS@BioMed at ImageCLEFmedical Caption 2024: Enhanced Attention Mechanisms in Medical Caption Generation through Concept Detection Integration	Jun 1, 2024	Caption GenerationImage Captioning	—Unverified
Dual Attention on Pyramid Feature Maps for Image Captioning	Nov 2, 2020	DescriptiveImage Captioning	—Unverified
Dual-CNN: A Convolutional language decoder for paragraph image captioning	Feb 14, 2020	DecoderDiversity	—Unverified
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training	Mar 17, 2022	DenoisingImage Captioning	—Unverified
Dynamic Feature Selection with Attention in Incremental Parsing	Aug 1, 2018	Dependency ParsingDialogue Generation	—Unverified
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning	Jun 3, 2021	DecoderImage Captioning	—Unverified
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents	Feb 6, 2025	Image CaptioningOptical Character Recognition	—Unverified
ECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning	Jan 25, 2021	Image CaptioningObject	—Unverified
Edit Flows: Flow Matching with Edit Operations	Jun 10, 2025	Code GenerationImage Captioning	—Unverified
Edit me: A Corpus and a Framework for Understanding Natural Language Image Editing	May 1, 2018	Image CaptioningQuestion Answering	—Unverified
Effect of Data Annotation, Feature Selection and Model Choice on Spatial Description Generation in French	Sep 1, 2016	feature selectionImage Captioning	—Unverified
Efficient Few-Shot Continual Learning in Vision-Language Models	Feb 6, 2025	Continual LearningImage Captioning	—Unverified
Efficient Image Captioning for Edge Devices	Dec 18, 2022	CPUImage Captioning	—Unverified
Efficient Multi-modal Large Language Models via Visual Token Grouping	Nov 26, 2024	Image CaptioningQuestion Answering	—Unverified
Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring	Jul 1, 2021	Food RecognitionImage Captioning	—Unverified

Show:10 25 50

← PrevPage 29 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified