Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 201–250 of 1878 papers

Title	Date	Tasks	Status	Hype
Cross-Modal Consistency in Multimodal Large Language Models	Nov 14, 2024	Image Captioningobject-detection	—Unverified	0
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions	Nov 13, 2024	DescriptiveHallucination	CodeCode Available	0
Grounded Video Caption Generation	Nov 12, 2024	Caption GenerationImage Captioning	—Unverified	0
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions	Nov 12, 2024	DescriptiveImage Captioning	—Unverified	0
ViTOC: Vision Transformer and Object-aware Captioner	Nov 9, 2024	DiversityImage Captioning	—Unverified	0
Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models	Nov 8, 2024	Image CaptioningImage Generation	—Unverified	0
Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model	Nov 7, 2024	Image CaptioningImage Generation	CodeCode Available	0
Seeing is Deceiving: Exploitation of Visual Pathways in Multi-Modal Language Models	Nov 7, 2024	Adversarial AttackImage Captioning	—Unverified	0
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation	Nov 7, 2024	Contrastive LearningImage Captioning	CodeCode Available	4
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering	Nov 3, 2024	DescriptiveImage Captioning	—Unverified	0
Designing a Robust Radiology Report Generation System	Nov 2, 2024	Decision MakingDiagnostic	—Unverified	0
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP	Oct 31, 2024	Image CaptioningPrompt Learning	—Unverified	0
Nearest Neighbor Normalization Improves Multimodal Retrieval	Oct 31, 2024	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
Large Language Model Benchmarks in Medical Tasks	Oct 28, 2024	Image CaptioningLanguage Modeling	—Unverified	0
Image Generation from Image Captioning -- Invertible Approach	Oct 26, 2024	Image CaptioningImage Generation	—Unverified	0
Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Biases and Representations Using Natural Language Prompts	Oct 25, 2024	DenoisingImage Captioning	—Unverified	0
Backdoor in Seconds: Unlocking Vulnerabilities in Large Pre-trained Models via Model Editing	Oct 23, 2024	Adversarial AttackBackdoor Attack	—Unverified	0
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning	Oct 23, 2024	Image CaptioningInstruction Following	CodeCode Available	1
Altogether: Image Captioning via Re-aligning Alt-text	Oct 22, 2024	Image Captioningimage-classification	CodeCode Available	0
Frontiers in Intelligent Colonoscopy	Oct 22, 2024	Image Captioning	CodeCode Available	2
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use	Oct 21, 2024	Image CaptioningTask Planning	—Unverified	0
TIPS: Text-Image Pretraining with Spatial Awareness	Oct 21, 2024	Depth EstimationImage Captioning	CodeCode Available	2
MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images	Oct 21, 2024	Few-Shot LearningImage Captioning	CodeCode Available	0
An Efficient System for Automatic Map Storytelling -- A Case Study on Historical Maps	Oct 21, 2024	Image Captioning	CodeCode Available	0
RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models	Oct 17, 2024	Image CaptioningQuestion Answering	CodeCode Available	2
Hiding-in-Plain-Sight (HiPS) Attack on CLIP for Targetted Object Removal from Images	Oct 16, 2024	Image CaptioningObject	—Unverified	0
Self-adaptive Multimodal Retrieval-Augmented Generation	Oct 15, 2024	Image CaptioningRAG	CodeCode Available	0
MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages	Oct 14, 2024	ArticlesDescriptive	—Unverified	0
CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification	Oct 12, 2024	Image CaptioningPerson Re-Identification	—Unverified	0
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks	Oct 10, 2024	FairnessImage Captioning	CodeCode Available	0
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment	Oct 8, 2024	Audio captioningContrastive Learning	CodeCode Available	0
Core Tokensets for Data-efficient Sequential Training of Transformers	Oct 8, 2024	Image Captioningimage-classification	CodeCode Available	0
AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models	Oct 7, 2024	Image CaptioningImage-text Retrieval	—Unverified	0
CAPEEN: Image Captioning with Early Exits and Knowledge Distillation	Oct 6, 2024	DescriptiveImage Captioning	CodeCode Available	0
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark	Oct 4, 2024	Image CaptioningVideo Understanding	—Unverified	0
Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval	Oct 2, 2024	Image CaptioningRetrieval	—Unverified	0
Backdooring Vision-Language Models with Out-Of-Distribution Data	Oct 2, 2024	Image CaptioningImage to text	—Unverified	0
TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning	Sep 30, 2024	Image CaptioningObject	CodeCode Available	0
TrojVLM: Backdoor Attack Against Vision Language Models	Sep 28, 2024	Backdoor AttackImage Captioning	—Unverified	0
DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning	Sep 28, 2024	HallucinationImage Captioning	—Unverified	0
Enhancing Explainability in Multimodal Large Language Models Using Ontological Context	Sep 27, 2024	Image CaptioningQuestion Answering	—Unverified	0
A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning	Sep 27, 2024	DecoderFairness	—Unverified	0
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning	Sep 26, 2024	Image CaptioningRetrieval	CodeCode Available	1
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models	Sep 25, 2024	Image Captioning	CodeCode Available	4
Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning	Sep 23, 2024	Image CaptioningSemantic Similarity	—Unverified	0
Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization	Sep 22, 2024	HallucinationHallucination Evaluation	CodeCode Available	0
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology	Sep 21, 2024	BenchmarkingDepth Estimation	—Unverified	0
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs	Sep 20, 2024	Image CaptioningImage Comprehension	—Unverified	0
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models	Sep 20, 2024	BenchmarkingImage Captioning	CodeCode Available	1
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model	Sep 20, 2024	Image CaptioningPanoptic Segmentation	CodeCode Available	1

Show:10 25 50

← PrevPage 5 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified