Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 801–850 of 1878 papers

Title	Date	Tasks	Status	Hype
Image Caption Generation for Low-Resource Assamese Language	Nov 1, 2022	Caption GenerationDecoder	—Unverified	0
Text-Only Training for Image Captioning using Noise-Injected CLIP	Nov 1, 2022	DecoderImage Captioning	CodeCode Available	2
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention	Oct 28, 2022	Image CaptioningLanguage Modeling	—Unverified	0
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning	Oct 26, 2022	Cross-Modal RetrievalDecoder	—Unverified	0
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks	Oct 26, 2022	Image CaptioningLanguage Modeling	—Unverified	0
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data	Oct 23, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
PoseScript: Linking 3D Human Poses and Natural Language	Oct 21, 2022	Cross-Modal RetrievalImage Captioning	CodeCode Available	2
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation	Oct 20, 2022	DecoderImage Captioning	CodeCode Available	1
Image-Text Retrieval with Binary and Continuous Label Supervision	Oct 20, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
Prophet Attention: Predicting Attention with Future Attention for Image Captioning	Oct 19, 2022	Image Captioning	—Unverified	0
Aligning MAGMA by Few-Shot Learning and Finetuning	Oct 18, 2022	Few-Shot LearningImage Captioning	—Unverified	0
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective	Oct 18, 2022	Image CaptioningSentence	—Unverified	0
Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training	Oct 17, 2022	Image CaptioningNetwork Interpretation	CodeCode Available	0
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends	Oct 17, 2022	Few-Shot LearningImage Captioning	CodeCode Available	3
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting	Oct 13, 2022	Image CaptioningQuestion Answering	CodeCode Available	1
Visual Language Maps for Robot Navigation	Oct 11, 2022	3D ReconstructionImage Captioning	CodeCode Available	2
MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer	Oct 10, 2022	DecoderImage Captioning	CodeCode Available	0
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis	Oct 10, 2022	AllImage Captioning	CodeCode Available	1
Generating image captions with external encyclopedic knowledge	Oct 10, 2022	Caption GenerationImage Captioning	—Unverified	0
CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning	Oct 10, 2022	DecoderDenoising	CodeCode Available	1
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP	Oct 9, 2022	Image CaptioningOpen Vocabulary Semantic Segmentation	CodeCode Available	2
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement	Oct 7, 2022	Image CaptioningSarcasm Detection	CodeCode Available	1
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding	Oct 7, 2022	Chart Question AnsweringDiversity	CodeCode Available	2
Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning	Oct 4, 2022	Image CaptioningSentence	CodeCode Available	0
Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity	Oct 3, 2022	Audio captioningImage Captioning	—Unverified	0
On the Effects of Video Grounding on Language Models	Oct 1, 2022	Image CaptioningQuestion Answering	—Unverified	0
DeltaNet: Conditional Medical Report Generation for COVID-19 Diagnosis	Oct 1, 2022	COVID-19 DiagnosisDecoder	—Unverified	0
JPG - Jointly Learn to Align: Automated Disease Prediction and Radiology Report Generation	Oct 1, 2022	cross-modal alignmentDisease Prediction	—Unverified	0
Multi-view and Cross-view Brain Decoding	Oct 1, 2022	Brain DecodingImage Captioning	—Unverified	0
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation	Sep 30, 2022	DecoderImage Captioning	CodeCode Available	1
Linearly Mapping from Image to Text Space	Sep 30, 2022	Image CaptioningImage to text	CodeCode Available	1
Medical Image Captioning via Generative Pretrained Transformers	Sep 28, 2022	Caption GenerationDescriptive	—Unverified	0
Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text	Sep 28, 2022	Image CaptioningImage Retrieval	CodeCode Available	1
DRAMA: Joint Risk Localization and Captioning in Driving	Sep 22, 2022	Image Captioning	—Unverified	0
Show, Interpret and Tell: Entity-aware Contextualised Image Captioning in Wikipedia	Sep 21, 2022	ArticlesImage Captioning	—Unverified	0
Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering	Sep 21, 2022	Image CaptioningOptical Character Recognition (OCR)	—Unverified	0
Learning Distinct and Representative Styles for Image Captioning	Sep 17, 2022	DiversityImage Captioning	CodeCode Available	1
Belief Revision based Caption Re-ranker with Visual Semantic Information	Sep 16, 2022	Caption GenerationImage Captioning	CodeCode Available	1
LAVIS: A Library for Language-Vision Intelligence	Sep 15, 2022	BenchmarkingImage Captioning	—Unverified	0
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks	Sep 15, 2022	Action ClassificationAction Recognition	—Unverified	0
M^4I: Multi-modal Models Membership Inference	Sep 15, 2022	Image CaptioningInference Attack	CodeCode Available	1
PaLI: A Jointly-Scaled Multilingual Language-Image Model	Sep 14, 2022	DecoderFew-Shot Image Classification	—Unverified	0
PreSTU: Pre-Training for Scene-Text Understanding	Sep 12, 2022	DecoderImage Captioning	—Unverified	0
Every picture tells a story: Image-grounded controllable stylistic story generation	Sep 4, 2022	Image CaptioningImage to text	—Unverified	0
vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM	Sep 3, 2022	DecoderImage Captioning	—Unverified	0
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks	Aug 22, 2022	AllCross-Modal Retrieval	CodeCode Available	0
A Medical Semantic-Assisted Transformer for Radiographic Report Generation	Aug 22, 2022	Image CaptioningMedical Report Generation	—Unverified	0
Target-oriented Sentiment Classification with Sequential Cross-modal Semantic Graph	Aug 19, 2022	DecoderImage Captioning	CodeCode Available	0
VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media	Aug 18, 2022	DescriptiveDiversity	CodeCode Available	1
GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement	Aug 18, 2022	Grounded Situation RecognitionImage Captioning	CodeCode Available	0

Show:10 25 50

← PrevPage 17 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified