Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–225 of 1878 papers

Title	Date	Tasks	Status	Hype
Cross-Modal Consistency in Multimodal Large Language Models	Nov 14, 2024	Image Captioningobject-detection	—Unverified	0
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions	Nov 13, 2024	DescriptiveHallucination	CodeCode Available	0
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions	Nov 12, 2024	DescriptiveImage Captioning	—Unverified	0
Grounded Video Caption Generation	Nov 12, 2024	Caption GenerationImage Captioning	—Unverified	0
ViTOC: Vision Transformer and Object-aware Captioner	Nov 9, 2024	DiversityImage Captioning	—Unverified	0
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models	Nov 8, 2024	Image CaptioningImage Generation	—Unverified	0
Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model	Nov 7, 2024	Image CaptioningImage Generation	CodeCode Available	0
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation	Nov 7, 2024	Contrastive LearningImage Captioning	CodeCode Available	4
Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models	Nov 7, 2024	Adversarial AttackImage Captioning	—Unverified	0
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering	Nov 3, 2024	DescriptiveImage Captioning	—Unverified	0
Designing a Robust Radiology Report Generation System	Nov 2, 2024	Decision MakingDiagnostic	—Unverified	0
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP	Oct 31, 2024	Image CaptioningPrompt Learning	—Unverified	0
Nearest Neighbor Normalization Improves Multimodal Retrieval	Oct 31, 2024	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Large Language Model Benchmarks in Medical Tasks	Oct 28, 2024	Image CaptioningLanguage Modeling	—Unverified	0
Image Generation from Image Captioning -- Invertible Approach	Oct 26, 2024	Image CaptioningImage Generation	—Unverified	0
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts	Oct 25, 2024	DenoisingImage Captioning	—Unverified	0
Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing	Oct 23, 2024	Adversarial AttackBackdoor Attack	—Unverified	0
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning	Oct 23, 2024	Image CaptioningInstruction Following	CodeCode Available	1
Altogether: Image Captioning via Re-aligning Alt-text	Oct 22, 2024	Image Captioningimage-classification	—Unverified	0
Frontiers in Intelligent Colonoscopy	Oct 22, 2024	Image Captioning	CodeCode Available	2
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use	Oct 21, 2024	Image CaptioningTask Planning	—Unverified	0
TIPS: Text-Image Pretraining with Spatial Awareness	Oct 21, 2024	Depth EstimationImage Captioning	CodeCode Available	2
An Efficient System for Automatic Map Storytelling -- A Case Study on Historical Maps	Oct 21, 2024	Image Captioning	CodeCode Available	0
MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images	Oct 21, 2024	Few-Shot LearningImage Captioning	CodeCode Available	0
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models	Oct 17, 2024	Image CaptioningQuestion Answering	CodeCode Available	2

Show:10 25 50

← PrevPage 9 of 76Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified