Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 226–250 of 1878 papers

Title	Date	Tasks	Status	Hype
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning	Dec 27, 2022	Image CaptioningImage Retrieval	CodeCode Available	1
On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective	Dec 24, 2022	Decision MakingImage Captioning	CodeCode Available	1
Position-guided Text Prompt for Vision-Language Pre-training	Dec 19, 2022	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift	Dec 15, 2022	BenchmarkingImage Captioning	CodeCode Available	1
Aesthetically Relevant Image Captioning	Nov 25, 2022	Image CaptioningSentence	CodeCode Available	1
Exploring Discrete Diffusion Models for Image Captioning	Nov 21, 2022	Image CaptioningImage Generation	CodeCode Available	1
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision	Nov 17, 2022	Image CaptioningQuestion Answering	CodeCode Available	1
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning	Nov 17, 2022	Image Captioning	CodeCode Available	1
PromptCap: Prompt-Guided Task-Aware Image Captioning	Nov 15, 2022	Image CaptioningLanguage Modelling	CodeCode Available	1
Large-Scale Bidirectional Training for Zero-Shot Image Captioning	Nov 13, 2022	Image CaptioningKeyword Extraction	CodeCode Available	1
DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis	Nov 12, 2022	COVID-19 DiagnosisDecoder	CodeCode Available	1
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation	Oct 20, 2022	DecoderImage Captioning	CodeCode Available	1
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting	Oct 13, 2022	Image CaptioningQuestion Answering	CodeCode Available	1
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis	Oct 10, 2022	AllImage Captioning	CodeCode Available	1
CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning	Oct 10, 2022	DecoderDenoising	CodeCode Available	1
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement	Oct 7, 2022	Image CaptioningSarcasm Detection	CodeCode Available	1
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation	Sep 30, 2022	DecoderImage Captioning	CodeCode Available	1
Linearly Mapping from Image to Text Space	Sep 30, 2022	Image CaptioningImage to text	CodeCode Available	1
Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text	Sep 28, 2022	Image CaptioningImage Retrieval	CodeCode Available	1
Learning Distinct and Representative Styles for Image Captioning	Sep 17, 2022	DiversityImage Captioning	CodeCode Available	1
Belief Revision based Caption Re-ranker with Visual Semantic Information	Sep 16, 2022	Caption GenerationImage Captioning	CodeCode Available	1
M^4I: Multi-modal Models Membership Inference	Sep 15, 2022	Image CaptioningInference Attack	CodeCode Available	1
VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media	Aug 18, 2022	DescriptiveDiversity	CodeCode Available	1
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning	Aug 13, 2022	Image Captioning	CodeCode Available	1
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning	Aug 8, 2022	Image CaptioningImage Generation	CodeCode Available	1

Show:10 25 50

← PrevPage 10 of 76Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified