Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 1878 papers

Title	Date	Tasks	Status	Hype
A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation	Dec 20, 2024	Image Captioning	CodeCode Available	0
Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution	Dec 20, 2024	Answer GenerationImage Captioning	CodeCode Available	0
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage	Dec 20, 2024	AttributeBenchmarking	—Unverified	0
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation	Dec 20, 2024	Image Captioning	CodeCode Available	0
Dataset Augmentation by Mixing Visual Concepts	Dec 19, 2024	Image Captioning	—Unverified	0
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models	Dec 19, 2024	Autonomous DrivingImage Captioning	CodeCode Available	0
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution	Dec 19, 2024	Depth EstimationImage Captioning	—Unverified	0
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception	Dec 18, 2024	DescriptiveHuman-Object Interaction Detection	CodeCode Available	0
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o	Dec 18, 2024	Image CaptioningVideo Captioning	CodeCode Available	1
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts	Dec 18, 2024	Action DetectionDescriptive	CodeCode Available	0
Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models	Dec 18, 2024	document understandingImage Captioning	CodeCode Available	1
Maybe you are looking for CroQS: Cross-modal Query Suggestion for Text-to-Image Retrieval	Dec 18, 2024	Cross-Modal RetrievalImage Captioning	—Unverified	0
MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants	Dec 17, 2024	Image CaptioningQuestion Answering	CodeCode Available	1
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension	Dec 16, 2024	BenchmarkingImage Captioning	—Unverified	0
UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer	Dec 16, 2024	Image Captioning	—Unverified	0
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track	Dec 15, 2024	Image CaptioningMedical Question Answering	—Unverified	0
From Simple to Professional: A Combinatorial Controllable Image Captioning Agent	Dec 15, 2024	Caption Generationcontrollable image captioning	CodeCode Available	0
Optimizing Vision-Language Interactions Through Decoder-Only Models	Dec 14, 2024	DecoderImage Captioning	—Unverified	0
Automated Image Captioning with CNNs and Transformers	Dec 13, 2024	DescriptiveHyperparameter Optimization	CodeCode Available	0
Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals	Dec 12, 2024	Image CaptioningImage Generation	—Unverified	0
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning	Dec 11, 2024	AttributeBenchmarking	CodeCode Available	1
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models	Dec 11, 2024	Image CaptioningImage Generation	—Unverified	0
How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey	Dec 11, 2024	Image CaptioningQuestion Answering	—Unverified	0
3D Spatial Understanding in MLLMs: Disambiguation and Evaluation	Dec 9, 2024	3D dense captioning3D visual grounding	—Unverified	0
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models	Dec 8, 2024	Image Captioning	CodeCode Available	0
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing	Dec 7, 2024	Answer GenerationGraph Generation	—Unverified	0
Automated Medical Report Generation for ECG Data: Bridging Medical Text and Signal Processing with Deep Learning	Dec 5, 2024	Comment GenerationDecoder	CodeCode Available	0
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion	Dec 5, 2024	Contrastive LearningHallucination	CodeCode Available	3
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis	Dec 4, 2024	Image CaptioningImage Description	—Unverified	0
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey	Dec 3, 2024	Change DetectionDescriptive	CodeCode Available	3
Progress-Aware Video Frame Captioning	Dec 3, 2024	Image CaptioningVideo Captioning	—Unverified	0
CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs	Dec 3, 2024	Image CaptioningQuantization	—Unverified	0
DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding	Dec 2, 2024	Caption GenerationDomain Generalization	—Unverified	0
Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring	Dec 1, 2024	Automated Theorem ProvingGeometry Problem Solving	—Unverified	0
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers	Nov 28, 2024	Image Captioningimage-classification	—Unverified	0
OPCap:Object-aware Prompting Captioning	Nov 27, 2024	AttributeDecoder	—Unverified	0
Active Data Curation Effectively Distills Large-Scale Multimodal Models	Nov 27, 2024	DecoderImage Captioning	—Unverified	0
Efficient Multi-modal Large Language Models via Visual Token Grouping	Nov 26, 2024	Image CaptioningQuestion Answering	—Unverified	0
LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation	Nov 25, 2024	Image CaptioningRAG	CodeCode Available	1
Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models	Nov 25, 2024	AttributeComputational Efficiency	—Unverified	0
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks	Nov 24, 2024	Image CaptioningNatural Language Understanding	—Unverified	0
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity	Nov 23, 2024	AttributeCross-Modal Retrieval	—Unverified	0
FG-CXR: A Radiologist-Aligned Gaze Dataset for Enhancing Interpretability in Chest X-Ray Report Generation	Nov 23, 2024	AnatomyImage Captioning	CodeCode Available	1
Uterine Ultrasound Image Captioning Using Deep Learning Techniques	Nov 21, 2024	Deep LearningDescriptive	—Unverified	0
LMM-driven Semantic Image-Text Coding for Ultra Low-bitrate Learned Image Compression	Nov 20, 2024	Image CaptioningImage Compression	CodeCode Available	1
Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment	Nov 19, 2024	Image CaptioningImage Quality Assessment	—Unverified	0
AI Flow at the Network Edge	Nov 19, 2024	Image Captioning	—Unverified	0
The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning	Nov 18, 2024	Image Captioning	CodeCode Available	0
Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning	Nov 17, 2024	Image CaptioningLanguage Modeling	CodeCode Available	0
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild	Nov 17, 2024	Active LearningImage Captioning	—Unverified	0

Show:10 25 50

← PrevPage 4 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified