Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 101–150 of 1878 papers

Title	Date	Tasks	Status	Hype
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting	Feb 20, 2025	Image Captioningmultimodal interaction	—Unverified	0
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness	Feb 19, 2025	Image CaptioningKeyword Extraction	—Unverified	0
InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models	Feb 19, 2025	Image Captioning	—Unverified	0
A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models	Feb 19, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Pretrained Image-Text Models are Secretly Video Captioners	Feb 19, 2025	Image CaptioningVideo Captioning	CodeCode Available	0
GroundCap: A Visually Grounded Image Captioning Dataset	Feb 19, 2025	Image CaptioningObject Detection	—Unverified	0
TPCap: Unlocking Zero-Shot Image Captioning with Trigger-Augmented and Multi-Modal Purification Modules	Feb 16, 2025	GPUImage Captioning	—Unverified	0
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified	0
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis	Feb 13, 2025	Cross-Modal RetrievalImage Captioning	CodeCode Available	1
FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning	Feb 13, 2025	Caption GenerationDecoder	—Unverified	0
Vision-Language Models for Edge Networks: A Comprehensive Survey	Feb 11, 2025	Autonomous VehiclesImage Captioning	—Unverified	0
Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?	Feb 10, 2025	Image CaptioningSemantic correspondence	CodeCode Available	0
Generative Distribution Prediction: A Unified Approach to Multimodal Learning	Feb 10, 2025	Domain AdaptationImage Captioning	—Unverified	0
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding	Feb 9, 2025	Image CaptioningImage-text Retrieval	CodeCode Available	3
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents	Feb 6, 2025	Image CaptioningOptical Character Recognition	—Unverified	0
Efficient Few-Shot Continual Learning in Vision-Language Models	Feb 6, 2025	Continual LearningImage Captioning	—Unverified	0
TexLiDAR: Automated Text Understanding for Panoramic LiDAR Data	Feb 5, 2025	Image Captioningobject-detection	CodeCode Available	0
Exploring Spatial Language Grounding Through Referring Expressions	Feb 4, 2025	Image CaptioningNegation	—Unverified	0
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation	Feb 4, 2025	Image CaptioningPanoptic Segmentation	—Unverified	0
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models	Feb 3, 2025	Adversarial RobustnessImage Captioning	CodeCode Available	1
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding	Jan 30, 2025	BenchmarkingDecision Making	—Unverified	0
Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes	Jan 23, 2025	Emotion ClassificationImage Captioning	CodeCode Available	0
An Ensemble Model with Attention Based Mechanism for Image Captioning	Jan 22, 2025	Ensemble LearningImage Captioning	—Unverified	0
PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model	Jan 21, 2025	HallucinationImage Captioning	CodeCode Available	1
Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis	Jan 16, 2025	DecoderImage Captioning	CodeCode Available	0
LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport	Jan 16, 2025	AudioCapsAudio captioning	CodeCode Available	1
Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness	Jan 16, 2025	Adversarial DefenseAdversarial Robustness	—Unverified	0
VCRScore: Image captioning metric based on V\&L Transformers, CLIP, and precision-recall	Jan 15, 2025	Image Captioning	—Unverified	0
RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment	Jan 13, 2025	Concept AlignmentImage Captioning	CodeCode Available	1
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing	Jan 12, 2025	Image CaptioningLanguage Modeling	—Unverified	0
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design	Jan 10, 2025	Image CaptioningLanguage Modeling	CodeCode Available	3
Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time	Jan 8, 2025	Image CaptioningStyle Transfer	—Unverified	0
Evaluating Image Caption via Cycle-consistent Text-to-Image Generation	Jan 7, 2025	Contrastive LearningDiversity	—Unverified	0
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?	Jan 5, 2025	Image CaptioningImage to text	CodeCode Available	1
Decoding fMRI Data into Captions using Prefix Language Modeling	Jan 5, 2025	Brain DecodingImage Captioning	CodeCode Available	0
MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning	Jan 3, 2025	DiagnosticGeneral Knowledge	—Unverified	0
Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception	Jan 1, 2025	Image CaptioningImage Generation	—Unverified	0
AdaDARE-gamma: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation	Jan 1, 2025	Image CaptioningQuestion Answering	—Unverified	0
Diffusion Bridge: Leveraging Diffusion Model to Reduce the Modality Gap Between Text and Vision for Zero-Shot Image Captioning	Jan 1, 2025	cross-modal alignmentDenoising	CodeCode Available	1
Semantic and Expressive Variations in Image Captions Across Languages	Jan 1, 2025	DescriptiveImage Captioning	—Unverified	0
Variance-Based Membership Inference Attacks Against Large-Scale Image Captioning Models	Jan 1, 2025	Image CaptioningMemorization	—Unverified	0
Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution	Jan 1, 2025	Depth EstimationImage Captioning	—Unverified	0
Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning	Dec 31, 2024	Caption GenerationDecoder	—Unverified	0
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering	Dec 30, 2024	Image CaptioningObject Recognition	—Unverified	0
ErgoChat: a Visual Query System for the Ergonomic Risk Assessment of Construction Workers	Dec 27, 2024	Image CaptioningQuestion Answering	—Unverified	0
ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning	Dec 26, 2024	Image CaptioningRetrieval	CodeCode Available	0
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation	Dec 24, 2024	Image CaptioningImage Generation	CodeCode Available	2
Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy	Dec 23, 2024	Image CaptioningQuestion Answering	—Unverified	0
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning	Dec 23, 2024	Image CaptioningLanguage Modeling	—Unverified	0
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization	Dec 21, 2024	Image CaptioningMultimodal Reasoning	CodeCode Available	0

Show:10 25 50

← PrevPage 3 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified