Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 901–950 of 1878 papers

Title	Date	Tasks	Status
Text-based Person Search without Parallel Image-Text Data	May 22, 2023	Image CaptioningLanguage Modeling	—Unverified
A request for clarity over the End of Sequence token in the Self-Critical Sequence Training	May 20, 2023	Image CaptioningSentence	CodeCode Available
Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment	May 20, 2023	Image CaptioningTranslation	—Unverified
DiffCap: Exploring Continuous Diffusion on Image Captioning	May 20, 2023	Caption GenerationDiversity	—Unverified
Semantic Composition in Visually Grounded Language Models	May 15, 2023	Image CaptioningInductive Bias	—Unverified
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images	May 12, 2023	Hyperparameter OptimizationImage Captioning	CodeCode Available
Simple Token-Level Confidence Improves Caption Correctness	May 11, 2023	HallucinationImage Captioning	—Unverified
Towards L-System Captioning for Tree Reconstruction	May 10, 2023	Image Captioning	—Unverified
Exploiting Pseudo Image Captions for Multimodal Summarization	May 9, 2023	Common Sense ReasoningContrastive Learning	—Unverified
UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese	May 7, 2023	Image CaptioningVietnamese Image Captioning	—Unverified
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding	May 5, 2023	ArticlesImage Captioning	CodeCode Available
The Role of Data Curation in Image Captioning	May 5, 2023	Few-Shot LearningImage Captioning	CodeCode Available
Image Captioners Sometimes Tell More Than Images They See	May 4, 2023	DescriptiveImage Captioning	—Unverified
Multimodal Data Augmentation for Image Captioning using Diffusion Models	May 3, 2023	Data AugmentationImage Captioning	CodeCode Available
Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime	May 3, 2023	Image CaptioningQuestion Answering	—Unverified
Fairness in AI Systems: Mitigating gender bias from language-vision models	May 3, 2023	FairnessImage Captioning	—Unverified
Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment	Apr 28, 2023	Data AugmentationImage Captioning	—Unverified
Learning Human-Human Interactions in Images from Weak Textual Supervision	Apr 27, 2023	Human-Human Interaction RecognitionImage Captioning	—Unverified
TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models	Apr 18, 2023	Data AugmentationDiversity	CodeCode Available
A-CAP: Anticipation Captioning with Commonsense Knowledge	Apr 13, 2023	Image CaptioningLanguage Modeling	—Unverified
Boosting Cross-task Transferability of Adversarial Patches with Visual Relations	Apr 11, 2023	Image CaptioningObject Recognition	—Unverified
Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT	Apr 11, 2023	DiagnosticImage Captioning	—Unverified
ImageCaptioner^2: Image Captioner for Image Captioning Bias Amplification Assessment	Apr 10, 2023	Image Captioning	—Unverified
Model-Agnostic Gender Debiased Image Captioning	Apr 7, 2023	Image Captioningmodel	CodeCode Available
Towards Self-Explainability of Deep Neural Networks with Heatmap Captioning and Large-Language Models	Apr 5, 2023	Explainable Artificial Intelligence (XAI)Image Captioning	—Unverified
Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data	Apr 4, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
Cross-Domain Image Captioning with Discriminative Finetuning	Apr 4, 2023	DescriptiveImage Captioning	—Unverified
Grand Challenge On Detecting Cheapfakes	Apr 3, 2023	Image Captioning	CodeCode Available
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	Mar 29, 2023	Image CaptioningInstance Segmentation	—Unverified
Variational Distribution Learning for Unsupervised Text-to-Image Generation	Mar 28, 2023	Image CaptioningImage Generation	—Unverified
Open-Vocabulary Object Detection using Pseudo Caption Labels	Mar 23, 2023	Image CaptioningKnowledge Distillation	—Unverified
Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings	Mar 20, 2023	Image CaptioningRetrieval	—Unverified
Multi-modal reward for visual relationships-based image captioning	Mar 19, 2023	Caption GenerationDeep Reinforcement Learning	—Unverified
Visual Information Matters for ASR Error Correction	Mar 16, 2023	Automatic Speech RecognitionAutomatic Speech Recognition (ASR)	—Unverified
PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning	Mar 15, 2023	Image Captioning	—Unverified
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images	Mar 13, 2023	Common Sense ReasoningExplanation Generation	—Unverified
Learning Combinatorial Prompts for Universal Controllable Image Captioning	Mar 11, 2023	controllable image captioningImage Captioning	—Unverified
Adapting Contrastive Language-Image Pretrained (CLIP) Models for Out-of-Distribution Detection	Mar 10, 2023	Anomaly DetectionImage Captioning	CodeCode Available
Interpretable Visual Question Answering Referring to Outside Knowledge	Mar 8, 2023	DiversityImage Captioning	—Unverified
Graph Neural Networks in Vision-Language Image Understanding: A Survey	Mar 7, 2023	Image CaptioningImage Retrieval	—Unverified
Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning	Mar 5, 2023	Image Captioning	—Unverified
Language Is Not All You Need: Aligning Perception with Language Models	Feb 27, 2023	AllImage Captioning	—Unverified
Tuning computer vision models with task rewards	Feb 16, 2023	ColorizationImage Captioning	—Unverified
See Your Heart: Psychological states Interpretation through Visual Creations	Feb 11, 2023	Emotion ClassificationImage Captioning	—Unverified
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning	Feb 9, 2023	Few-Shot LearningImage Captioning	—Unverified
Nemesis: Neural Mean Teacher Learning-Based Emotion-Centric Speaker	Feb 9, 2023	Image Captioning	—Unverified
Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning	Feb 8, 2023	Caption GenerationDecoder	—Unverified
KENGIC: KEyword-driven and N-Gram Graph based Image Captioning	Feb 7, 2023	Image Captioning	—Unverified
Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning	Feb 4, 2023	Caption GenerationCoherence Evaluation	CodeCode Available
DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps	Feb 3, 2023	Image CaptioningOptical Character Recognition (OCR)	—Unverified

Show:10 25 50

← PrevPage 19 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified