Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 451–500 of 1878 papers

Title	Date	Tasks	Status	Hype
Jewelry Recognition via Encoder-Decoder Models	Jan 15, 2024	DecoderImage Captioning	—Unverified	0
What Else Would I Like? A User Simulator using Alternatives for Improved Evaluation of Fashion Conversational Recommendation Systems	Jan 11, 2024	Conversational RecommendationImage Captioning	—Unverified	0
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding	Jan 9, 2024	Image Captioningimage-classification	—Unverified	0
MAMI: Multi-Attentional Mutual-Information for Long Sequence Neuron Captioning	Jan 5, 2024	DecoderImage Captioning	—Unverified	0
Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding	Jan 5, 2024	Image CaptioningMachine Translation	CodeCode Available	0
Object-oriented backdoor attack against image captioning	Jan 5, 2024	Backdoor AttackImage Captioning	—Unverified	0
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment	Jan 4, 2024	Image Captioningimage-classification	—Unverified	0
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training	Jan 4, 2024	DescriptiveImage Captioning	CodeCode Available	1
Social Media Ready Caption Generation for Brands	Jan 3, 2024	Caption GenerationImage Captioning	—Unverified	0
GPT-4V(ision) is a Generalist Web Agent, if Grounded	Jan 3, 2024	Image CaptioningQuestion Answering	CodeCode Available	4
Learning Vision from Models Rivals Learning Vision from Data	Dec 28, 2023	Contrastive LearningImage Captioning	CodeCode Available	2
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones	Dec 28, 2023	Computational EfficiencyImage Captioning	CodeCode Available	3
Cycle-Consistency Learning for Captioning and Grounding	Dec 23, 2023	Image CaptioningVisual Grounding	—Unverified	0
LLM4VG: Large Language Models Evaluation for Video Grounding	Dec 21, 2023	Image CaptioningVideo Grounding	—Unverified	0
VCoder: Versatile Vision Encoders for Multimodal Large Language Models	Dec 21, 2023	Image CaptioningImage Generation	CodeCode Available	2
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models	Dec 17, 2023	Image CaptioningQuestion Answering	CodeCode Available	0
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models	Dec 15, 2023	Image CaptioningIn-Context Learning	CodeCode Available	1
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning	Dec 15, 2023	Factual Inconsistency Detection in Chart CaptioningImage Captioning	CodeCode Available	1
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	Dec 14, 2023	Image CaptioningImage Generation	CodeCode Available	1
Dietary Assessment with Multimodal ChatGPT: A Systematic Analysis	Dec 14, 2023	Image CaptioningScene Understanding	—Unverified	0
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning	Dec 14, 2023	cross-modal alignmentDecoder	—Unverified	0
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions	Dec 14, 2023	Image Captioning	CodeCode Available	1
Synocene, Beyond the Anthropocene: De-Anthropocentralising Human-Nature-AI Interaction	Dec 13, 2023	ChatbotImage Captioning	—Unverified	0
Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data	Dec 11, 2023	Image CaptioningImage-text Retrieval	—Unverified	0
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator	Dec 11, 2023	Image CaptioningQuestion Answering	CodeCode Available	1
Unifying Text, Tables, and Images for Multimodal Question Answering	Dec 10, 2023	Image CaptioningQuestion Answering	CodeCode Available	0
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects	Dec 8, 2023	Image Captioningobject-detection	—Unverified	0
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning	Dec 8, 2023	Image CaptioningLanguage Modeling	—Unverified	0
PixLore: A Dataset-driven Approach to Rich Image Captioning	Dec 8, 2023	GPUImage Captioning	CodeCode Available	0
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos	Dec 7, 2023	DiagnosticImage Captioning	CodeCode Available	1
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks	Dec 6, 2023	Image Captioningimage-classification	—Unverified	0
Mitigating Open-Vocabulary Caption Hallucinations	Dec 6, 2023	DiversityHallucination	CodeCode Available	1
Towards More Unified In-context Visual Understanding	Dec 5, 2023	DecoderImage Captioning	—Unverified	0
CLAMP: Contrastive LAnguage Model Prompt-tuning	Dec 4, 2023	Contrastive LearningImage Captioning	—Unverified	0
Automatic Report Generation for Histopathology images using pre-trained Vision Transformers and BERT	Dec 3, 2023	Caption GenerationDecoder	CodeCode Available	0
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning	Dec 2, 2023	Causal Language ModelingContrastive Learning	CodeCode Available	1
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	Dec 1, 2023	HallucinationImage Captioning	CodeCode Available	6
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts	Dec 1, 2023	Chart Question AnsweringDocument AI	—Unverified	0
Video Summarization: Towards Entity-Aware Captions	Dec 1, 2023	Image CaptioningVideo Captioning	CodeCode Available	0
Enhancing Image Captioning with Neural Models	Dec 1, 2023	Caption GenerationImage Captioning	—Unverified	0
InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation	Nov 30, 2023	Image CaptioningReferring Expression	CodeCode Available	0
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner	Nov 29, 2023	Contrastive LearningImage Captioning	CodeCode Available	1
A natural language processing-based approach: mapping human perception by understanding deep semantic features in street view images	Nov 29, 2023	Image CaptioningLanguage Modelling	—Unverified	0
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models	Nov 28, 2023	Image CaptioningImage-text matching	CodeCode Available	1
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	Nov 28, 2023	Image CaptioningQuestion Answering	CodeCode Available	2
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training	Nov 28, 2023	Image CaptioningTransfer Learning	—Unverified	0
EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension	Nov 27, 2023	Image CaptioningObject	—Unverified	0
DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism	Nov 25, 2023	Caption GenerationDenoising	—Unverified	0
Violet: A Vision-Language Model for Arabic Image Captioning with Gemini Decoder	Nov 15, 2023	DecoderImage Captioning	—Unverified	0
Improving Image Captioning via Predicting Structured Concepts	Nov 14, 2023	Image Captioning	—Unverified	0

Show:10 25 50

← PrevPage 10 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified