Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 751–800 of 1878 papers

Title	Date	Tasks	Status	Hype
Towards Models that Can See and Read	Jan 18, 2023	DecoderImage Captioning	—Unverified	0
Embodied Agents for Efficient Exploration and Smart Scene Description	Jan 17, 2023	Efficient ExplorationImage Captioning	—Unverified	0
See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning	Jan 12, 2023	Few-Shot LearningImage Captioning	CodeCode Available	1
An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)	Jan 6, 2023	DecoderImage Captioning	—Unverified	0
Adaptively Clustering Neighbor Elements for Image-Text Generation	Jan 5, 2023	ClusteringDecoder	CodeCode Available	0
An Empirical Investigation into the Use of Image Captioning for Automated Software Documentation	Jan 3, 2023	Image CaptioningMachine Translation	—Unverified	0
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3	Jan 1, 2023	Image CaptioningQuestion Answering	—Unverified	0
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks	Jan 1, 2023	Cross-Modal RetrievalImage Captioning	—Unverified	0
Crossing the Gap: Domain Generalization for Image Captioning	Jan 1, 2023	Domain GeneralizationImage Captioning	—Unverified	0
On the Interpretability of Attention Networks	Dec 30, 2022	Image Captioning	CodeCode Available	0
Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning	Dec 27, 2022	Image CaptioningImage Retrieval	CodeCode Available	1
On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective	Dec 24, 2022	Decision MakingImage Captioning	CodeCode Available	1
Do DALL-E and Flamingo Understand Each Other?	Dec 23, 2022	Image CaptioningImage Generation	—Unverified	0
Transferring General Multimodal Pretrained Models to Text Recognition	Dec 19, 2022	Image CaptioningOptical Character Recognition (OCR)	—Unverified	0
Position-guided Text Prompt for Vision-Language Pre-training	Dec 19, 2022	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Efficient Image Captioning for Edge Devices	Dec 18, 2022	CPUImage Captioning	—Unverified	0
Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift	Dec 15, 2022	BenchmarkingImage Captioning	CodeCode Available	1
Cross-Modal Similarity-Based Curriculum Learning for Image Captioning	Dec 14, 2022	Image CaptioningLanguage Modeling	—Unverified	0
NLIP: Noise-robust Language-Image Pre-training	Dec 14, 2022	Image CaptioningImage-text Retrieval	—Unverified	0
Cap2Aug: Caption guided Image to Image data Augmentation	Dec 11, 2022	ClassificationCross-Domain Few-Shot	—Unverified	0
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory	Dec 10, 2022	Image CaptioningLanguage Modeling	CodeCode Available	0
ParsVQA-Caps: A Benchmark for Visual Question Answering and Image Captioning in Persian	Dec 7, 2022	Image CaptioningQuestion Answering	—Unverified	0
Dataset vs Reality: Understanding Model Performance from the Perspective of Information Need	Dec 6, 2022	Image CaptioningInformation Retrieval	—Unverified	0
Semantic-Conditional Diffusion Networks for Image Captioning	Dec 6, 2022	Cross-Modal RetrievalDecoder	CodeCode Available	2
Adaptive Testing of Computer Vision Models	Dec 6, 2022	Image Captioningobject-detection	CodeCode Available	0
Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning	Dec 6, 2022	Image Captioningreinforcement-learning	CodeCode Available	0
Controllable Image Captioning via Prompting	Dec 4, 2022	controllable image captioningImage Captioning	—Unverified	0
Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset	Dec 1, 2022	Image CaptioningImage Generation	—Unverified	0
Focus! Relevant and Sufficient Context Selection for News Image Captioning	Dec 1, 2022	Image CaptioningRelation Extraction	—Unverified	0
Uncertainty-Aware Image Captioning	Nov 30, 2022	Caption GenerationImage Captioning	—Unverified	0
CLID: Controlled-Length Image Descriptions with Limited Data	Nov 27, 2022	controllable image captioningImage Captioning	CodeCode Available	0
Predictive linguistic cues for fake news: a societal artificial intelligence problem	Nov 26, 2022	AttributeImage Captioning	—Unverified	0
Aesthetically Relevant Image Captioning	Nov 25, 2022	Image CaptioningSentence	CodeCode Available	1
Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap	Nov 23, 2022	Image Captioningobject-detection	—Unverified	0
Retrieval-Augmented Multimodal Language Modeling	Nov 22, 2022	Caption GenerationImage Captioning	—Unverified	0
X^2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks	Nov 22, 2022	AllCross-Modal Retrieval	CodeCode Available	2
Exploring Discrete Diffusion Models for Image Captioning	Nov 21, 2022	Image CaptioningImage Generation	CodeCode Available	1
A survey on knowledge-enhanced multimodal learning	Nov 19, 2022	Conditional Image GenerationFactual Visual Question Answering	—Unverified	0
An Enhanced Object Detection Model for Scene Graph Generation	Nov 18, 2022	Graph GenerationImage Captioning	—Unverified	0
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision	Nov 17, 2022	Image CaptioningQuestion Answering	CodeCode Available	1
Feedback is Needed for Retakes: An Explainable Poor Image Notification Framework for the Visually Impaired	Nov 17, 2022	Image Captioning	—Unverified	0
Progressive Tree-Structured Prototype Network for End-to-End Image Captioning	Nov 17, 2022	Image Captioning	CodeCode Available	1
PromptCap: Prompt-Guided Task-Aware Image Captioning	Nov 15, 2022	Image CaptioningLanguage Modelling	CodeCode Available	1
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model	Nov 15, 2022	AllDisentanglement	CodeCode Available	6
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment	Nov 14, 2022	Computational EfficiencyImage Captioning	—Unverified	0
Large-Scale Bidirectional Training for Zero-Shot Image Captioning	Nov 13, 2022	Image CaptioningKeyword Extraction	CodeCode Available	1
DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis	Nov 12, 2022	COVID-19 DiagnosisDecoder	CodeCode Available	1
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics	Nov 12, 2022	Audio captioningImage Captioning	—Unverified	0
VieCap4H-VLSP 2021: ObjectAoA-Enhancing performance of Object Relation Transformer with Attention on Attention for Vietnamese image captioning	Nov 10, 2022	Image CaptioningVietnamese Image Captioning	—Unverified	0
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions	Nov 9, 2022	Image CaptioningLanguage Modeling	—Unverified	0

Show:10 25 50

← PrevPage 16 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified