Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 701–750 of 1878 papers

Title	Date	Tasks	Status	Hype
Cross-Domain Image Captioning with Discriminative Finetuning	Apr 4, 2023	DescriptiveImage Captioning	—Unverified	0
Grand Challenge On Detecting Cheapfakes	Apr 3, 2023	Image Captioning	CodeCode Available	0
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	Mar 29, 2023	Image CaptioningInstance Segmentation	—Unverified	0
AutoAD: Movie Description in Context	Mar 29, 2023	Image CaptioningText Generation	CodeCode Available	1
Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation	Mar 29, 2023	Image CaptioningImage-text matching	CodeCode Available	1
Variational Distribution Learning for Unsupervised Text-to-Image Generation	Mar 28, 2023	Image CaptioningImage Generation	—Unverified	0
Open-Vocabulary Object Detection using Pseudo Caption Labels	Mar 23, 2023	Image CaptioningKnowledge Distillation	—Unverified	0
MAGVLT: Masked Generative Vision-and-Language Transformer	Mar 21, 2023	Image CaptioningImage Generation	CodeCode Available	1
Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation	Mar 21, 2023	Contrastive LearningImage Captioning	CodeCode Available	1
Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings	Mar 20, 2023	Image CaptioningRetrieval	—Unverified	0
Multi-modal reward for visual relationships-based image captioning	Mar 19, 2023	Caption GenerationDeep Reinforcement Learning	—Unverified	0
Visual Information Matters for ASR Error Correction	Mar 16, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified	0
PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning	Mar 15, 2023	Image Captioning	—Unverified	0
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Mar 13, 2023	Common Sense ReasoningExplanation Generation	—Unverified	0
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions	Mar 12, 2023	Image CaptioningQuestion Answering	CodeCode Available	2
ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation	Mar 11, 2023	Image CaptioningImage to text	CodeCode Available	1
Learning Combinatorial Prompts for Universal Controllable Image Captioning	Mar 11, 2023	controllable image captioningImage Captioning	—Unverified	0
Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection	Mar 10, 2023	Anomaly DetectionImage Captioning	CodeCode Available	0
Spawrious: A Benchmark for Fine Control of Spurious Correlation Biases	Mar 9, 2023	Image Captioningimage-classification	CodeCode Available	1
Interpretable Visual Question Answering Referring to Outside Knowledge	Mar 8, 2023	DiversityImage Captioning	—Unverified	0
Graph Neural Networks in Vision-Language Image Understanding: A Survey	Mar 7, 2023	Image CaptioningImage Retrieval	—Unverified	0
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training	Mar 6, 2023	DecoderImage Captioning	CodeCode Available	1
Neighborhood Contrastive Transformer for Change Captioning	Mar 6, 2023	DecoderImage Captioning	CodeCode Available	1
Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning	Mar 5, 2023	Image Captioning	—Unverified	0
Prismer: A Vision-Language Model with Multi-Task Experts	Mar 4, 2023	Few-Shot LearningImage Captioning	CodeCode Available	1
ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing	Mar 4, 2023	DiversityImage Captioning	CodeCode Available	1
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks	Mar 4, 2023	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax	Mar 2, 2023	DescriptiveImage Captioning	CodeCode Available	1
Language Is Not All You Need: Aligning Perception with Language Models	Feb 27, 2023	AllImage Captioning	—Unverified	0
Retrieval-augmented Image Captioning	Feb 16, 2023	DecoderImage Captioning	CodeCode Available	1
Tuning computer vision models with task rewards	Feb 16, 2023	ColorizationImage Captioning	—Unverified	0
Towards Local Visual Modeling for Image Captioning	Feb 13, 2023	Image CaptioningObject Recognition	CodeCode Available	1
See Your Heart: Psychological states Interpretation through Visual Creations	Feb 11, 2023	Emotion ClassificationImage Captioning	—Unverified	0
Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker	Feb 9, 2023	Image Captioning	—Unverified	0
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning	Feb 9, 2023	Few-Shot LearningImage Captioning	—Unverified	0
Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning	Feb 8, 2023	Caption GenerationDecoder	—Unverified	0
KENGIC: KEyword-driven and N-Gram Graph based Image Captioning	Feb 7, 2023	Image Captioning	—Unverified	0
Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning	Feb 4, 2023	Caption GenerationCoherence Evaluation	CodeCode Available	0
DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps	Feb 3, 2023	Image CaptioningOptical Character Recognition (OCR)	—Unverified	0
IC3: Image Captioning by Committee Consensus	Feb 2, 2023	Image Captioning	CodeCode Available	1
UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers	Jan 31, 2023	Image CaptioningImage Classification	CodeCode Available	1
PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks	Jan 30, 2023	Crowd CountingData Augmentation	—Unverified	0
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	Jan 30, 2023	Generative Visual Question AnsweringImage Captioning	CodeCode Available	4
Exploring External Knowledge for Accurate modeling of Visual and Language Problems	Jan 27, 2023	Image CaptioningMachine Translation	—Unverified	0
Paraphrase Acquisition from Image Captions	Jan 26, 2023	ArticlesImage Captioning	CodeCode Available	0
Style-Aware Contrastive Learning for Multi-Style Image Captioning	Jan 26, 2023	Contrastive LearningImage Captioning	—Unverified	0
Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data	Jan 26, 2023	Image CaptioningRelational Captioning	—Unverified	0
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation	Jan 22, 2023	Common Sense ReasoningImage Captioning	—Unverified	0
Exploring the Synergy Between Vision-Language Pretraining and ChatGPT for Artwork Captioning: A Preliminary Study	Jan 21, 2023	Image CaptioningInformativeness	CodeCode Available	0
Visual Semantic Relatedness Dataset for Image Captioning	Jan 20, 2023	Image Captioningtext similarity	CodeCode Available	0

Show:10 25 50

← PrevPage 15 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified