Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 701–750 of 1878 papers

Title	Date	Tasks	Status
On Speculative Decoding for Multimodal Large Language Models	Apr 13, 2024	Image CaptioningLanguage Modeling	—Unverified
FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning	Apr 12, 2024	Federated LearningImage Captioning	CodeCode Available
Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation	Apr 6, 2024	Image CaptioningInstance Segmentation	—Unverified
Would Deep Generative Models Amplify Bias in Future Models?	Apr 4, 2024	Image CaptioningImage Generation	—Unverified
Jump Self-attention: Capturing High-order Statistics in Transformers	Apr 3, 2024	Image CaptioningNatural Language Understanding	—Unverified
VLRM: Vision-Language Models act as Reward Models for Image Captioning	Apr 2, 2024	Image Captioningreinforcement-learning	—Unverified
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning	Apr 1, 2024	Image CaptioningInstruction Following	CodeCode Available
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction	Apr 1, 2024	Image CaptioningInstruction Following	—Unverified
Text Data-Centric Image Captioning with Interactive Prompts	Mar 28, 2024	Image Captioning	—Unverified
LocCa: Visual Pretraining with Location-aware Captioners	Mar 28, 2024	DecoderImage Captioning	—Unverified
Semantic Map-based Generation of Navigation Instructions	Mar 28, 2024	Image Captioning	CodeCode Available
A Review of Multi-Modal Large Language and Vision Models	Mar 28, 2024	Image CaptioningPrompt Engineering	—Unverified
A Survey on Large Language Models from Concept to Implementation	Mar 27, 2024	ChatbotImage Captioning	—Unverified
Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study	Mar 26, 2024	DecoderImage Captioning	—Unverified
The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge	Mar 26, 2024	Caption GenerationImage Captioning	—Unverified
Visual Hallucination: Definition, Quantification, and Prescriptive Remediations	Mar 26, 2024	HallucinationImage Captioning	—Unverified
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching	Mar 26, 2024	Data AugmentationGraph Matching	—Unverified
Image Captioning in news report scenario	Mar 24, 2024	Image CaptioningRecommendation Systems	—Unverified
Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content	Mar 23, 2024	DescriptiveImage Captioning	CodeCode Available
A Multimodal Approach for Cross-Domain Image Retrieval	Mar 22, 2024	Image CaptioningImage Retrieval	—Unverified
MyVLM: Personalizing VLMs for User-Specific Queries	Mar 21, 2024	Image CaptioningLanguage Modelling	—Unverified
Inserting Faces inside Captions: Image Captioning with Attention Guided Merging	Mar 20, 2024	Image CaptioningRetrieval	—Unverified
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs	Mar 20, 2024	Audio captioningImage Captioning	—Unverified
As Firm As Their Foundations: Can open-sourced foundation models be used to create adversarial examples for downstream tasks?	Mar 19, 2024	Adversarial AttackImage Captioning	—Unverified
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition	Mar 19, 2024	Dense CaptioningImage Captioning	—Unverified
Towards Multimodal In-Context Learning for Vision & Language Models	Mar 19, 2024	Image CaptioningIn-Context Learning	—Unverified
TARN-VIST: Topic Aware Reinforcement Network for Visual Storytelling	Mar 18, 2024	Image CaptioningVisual Storytelling	—Unverified
Few-Shot VQA with Frozen LLMs: A Tale of Two Approaches	Mar 17, 2024	Image CaptioningQuestion Answering	—Unverified
Does the Performance of Text-to-Image Retrieval Models Generalize Beyond Captions-as-a-Query?	Mar 15, 2024	DescriptiveImage Captioning	CodeCode Available
Leveraging LLMs for On-the-Fly Instruction Guided Image Editing	Mar 12, 2024	Image Captioning	CodeCode Available
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings	Mar 12, 2024	Image CaptioningImage Generation	—Unverified
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes	Mar 12, 2024	3D dense captioningDense Captioning	—Unverified
Transformer based Multitask Learning for Image Captioning and Object Detection	Mar 10, 2024	Autonomous NavigationImage Captioning	—Unverified
The Case for Evaluating Multimodal Translation Models on Text Datasets	Mar 5, 2024	DescriptiveImage Captioning	—Unverified
What Is Missing in Multilingual Visual Reasoning and How to Fix It	Mar 3, 2024	Image CaptioningVisual Reasoning	CodeCode Available
Improving Explicit Spatial Relationships in Text-to-Image Generation through an Automatically Derived Dataset	Mar 1, 2024	Image CaptioningImage Generation	CodeCode Available
EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning	Feb 29, 2024	Image CaptioningSentence	—Unverified
Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction	Feb 28, 2024	Image CaptioningLanguage Modeling	—Unverified
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks	Feb 27, 2024	Domain GeneralizationImage Captioning	—Unverified
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing	Feb 23, 2024	Image CaptioningImage Retrieval	—Unverified
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions	Feb 20, 2024	Image CaptioningQuestion Answering	—Unverified
IRR: Image Review Ranking Framework for Evaluating Vision-Language Models	Feb 19, 2024	DiversityImage Captioning	—Unverified
AICAttack: Adversarial Image Captioning Attack with Attention-Based Optimization	Feb 19, 2024	Adversarial AttackImage Captioning	CodeCode Available
Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models	Feb 19, 2024	Image CaptioningQuestion Answering	—Unverified
Cobra Effect in Reference-Free Image Captioning Metrics	Feb 18, 2024	Image Captioning	CodeCode Available
Learning How To Ask: Cycle-Consistency Refines Prompts in Multimodal Foundation Models	Feb 13, 2024	Code GenerationHumanEval	—Unverified
Captions Are Worth a Thousand Words: Enhancing Product Retrieval with Pretrained Image-to-Text Models	Feb 13, 2024	Image CaptioningImage to text	—Unverified
Multimodal Learned Sparse Retrieval for Image Suggestion	Feb 12, 2024	Image CaptioningRetrieval	—Unverified
Consistency Model is an Effective Posterior Sample Approximation for Diffusion Inverse Solvers	Feb 9, 2024	Image CaptioningSemantic Segmentation	—Unverified
Large Language Models for Captioning and Retrieving Remote Sensing Images	Feb 9, 2024	Cross-Modal RetrievalDecoder	—Unverified

Show:10 25 50

← PrevPage 15 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified