Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 751–800 of 1878 papers

Title	Date	Tasks	Status
GPTs Are Multilingual Annotators for Sequence Generation Tasks	Feb 8, 2024	Image Captioning	CodeCode Available
Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing	Feb 8, 2024	Image CaptioningTAG	—Unverified
CIC: A Framework for Culturally-Aware Image Captioning	Feb 8, 2024	DescriptiveImage Captioning	—Unverified
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images	Feb 8, 2024	Image CaptioningQuestion Answering	CodeCode Available
Image captioning for Brazilian Portuguese using GRIT model	Feb 7, 2024	Image Captioningmodel	—Unverified
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?	Feb 7, 2024	Domain GeneralizationImage Captioning	—Unverified
PICS: Pipeline for Image Captioning and Search	Feb 1, 2024	Asset ManagementImage Captioning	—Unverified
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling	Feb 1, 2024	DiversityImage Captioning	—Unverified
Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data	Jan 31, 2024	BenchmarkingChange Detection	CodeCode Available
COCO is "ALL'' You Need for Visual Instruction Fine-tuning	Jan 17, 2024	AllImage Captioning	—Unverified
KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain	Jan 16, 2024	Image CaptioningVietnamese Image Captioning	—Unverified
Jewelry Recognition via Encoder-Decoder Models	Jan 15, 2024	DecoderImage Captioning	—Unverified
What Else Would I Like? A User Simulator using Alternatives for Improved Evaluation of Fashion Conversational Recommendation Systems	Jan 11, 2024	Conversational RecommendationImage Captioning	—Unverified
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding	Jan 9, 2024	Image Captioningimage-classification	—Unverified
MAMI: Multi-Attentional Mutual-Information for Long Sequence Neuron Captioning	Jan 5, 2024	DecoderImage Captioning	—Unverified
Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding	Jan 5, 2024	Image CaptioningMachine Translation	CodeCode Available
Object-oriented backdoor attack against image captioning	Jan 5, 2024	Backdoor AttackImage Captioning	—Unverified
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment	Jan 4, 2024	Image Captioningimage-classification	—Unverified
Social Media Ready Caption Generation for Brands	Jan 3, 2024	Caption GenerationImage Captioning	—Unverified
Cycle-Consistency Learning for Captioning and Grounding	Dec 23, 2023	Image CaptioningVisual Grounding	—Unverified
LLM4VG: Large Language Models Evaluation for Video Grounding	Dec 21, 2023	Image CaptioningVideo Grounding	—Unverified
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models	Dec 17, 2023	Image CaptioningQuestion Answering	CodeCode Available
Dietary Assessment with Multimodal ChatGPT: A Systematic Analysis	Dec 14, 2023	Image CaptioningScene Understanding	—Unverified
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning	Dec 14, 2023	cross-modal alignmentDecoder	—Unverified
Synocene, Beyond the Anthropocene: De-Anthropocentralising Human-Nature-AI Interaction	Dec 13, 2023	ChatbotImage Captioning	—Unverified
Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data	Dec 11, 2023	Image CaptioningImage-text Retrieval	—Unverified
Unifying Text, Tables, and Images for Multimodal Question Answering	Dec 10, 2023	Image CaptioningQuestion Answering	CodeCode Available
PixLore: A Dataset-driven Approach to Rich Image Captioning	Dec 8, 2023	GPUImage Captioning	CodeCode Available
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects	Dec 8, 2023	Image Captioningobject-detection	—Unverified
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning	Dec 8, 2023	Image CaptioningLanguage Modeling	—Unverified
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks	Dec 6, 2023	Image Captioningimage-classification	—Unverified
Towards More Unified In-context Visual Understanding	Dec 5, 2023	DecoderImage Captioning	—Unverified
CLAMP: Contrastive LAnguage Model Prompt-tuning	Dec 4, 2023	Contrastive LearningImage Captioning	—Unverified
Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT	Dec 3, 2023	Caption GenerationDecoder	CodeCode Available
Video Summarization: Towards Entity-Aware Captions	Dec 1, 2023	Image CaptioningVideo Captioning	CodeCode Available
Enhancing Image Captioning with Neural Models	Dec 1, 2023	Caption GenerationImage Captioning	—Unverified
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	Dec 1, 2023	Chart Question AnsweringDocument AI	—Unverified
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation	Nov 30, 2023	Image CaptioningReferring Expression	CodeCode Available
A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images	Nov 29, 2023	Image CaptioningLanguage Modelling	—Unverified
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training	Nov 28, 2023	Image CaptioningTransfer Learning	—Unverified
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension	Nov 27, 2023	Image CaptioningObject	—Unverified
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism	Nov 25, 2023	Caption GenerationDenoising	—Unverified
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder	Nov 15, 2023	DecoderImage Captioning	—Unverified
Improving Image Captioning via Predicting Structured Concepts	Nov 14, 2023	Image Captioning	—Unverified
Holistic Evaluation of GPT-4V for Biomedical Imaging	Nov 10, 2023	AnatomyDiagnostic	—Unverified
How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model	Nov 10, 2023	Image CaptioningLanguage Modeling	—Unverified
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language	Nov 8, 2023	Image CaptioningLanguage Modeling	CodeCode Available
DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding	Nov 7, 2023	3D ReconstructionBenchmarking	CodeCode Available
JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models	Nov 7, 2023	Image Captioning	CodeCode Available
Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning	Nov 2, 2023	Caption GenerationEfficient Exploration	—Unverified

Show:10 25 50

← PrevPage 16 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified