Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 351–400 of 1878 papers

Title	Date	Tasks	Status	Hype
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	May 9, 2024	Image CaptioningInstruction Following	CodeCode Available	2
Using Machine Translation to Augment Multilingual Classification	May 9, 2024	ClassificationImage Captioning	—Unverified	0
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model	May 3, 2024	Image CaptioningInstruction Following	CodeCode Available	0
A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)	May 2, 2024	Acoustic Scene ClassificationEvent Detection	—Unverified	0
Technical Report of NICE Challenge at CVPR 2024: Caption Re-ranking Evaluation Using Ensembled CLIP and Consensus Scores	May 2, 2024	Image CaptioningRe-Ranking	CodeCode Available	0
Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis	May 1, 2024	Image CaptioningQuestion Answering	—Unverified	0
What Makes for Good Image Captions?	May 1, 2024	HallucinationImage Captioning	—Unverified	0
Compressed Image Captioning using CNN-based Encoder-Decoder Framework	Apr 28, 2024	DecoderImage Captioning	—Unverified	0
Semi-supervised Text-based Person Search	Apr 28, 2024	Image CaptioningPerson Search	—Unverified	0
Learning text-to-video retrieval from image captioning	Apr 26, 2024	Image CaptioningImage Retrieval	—Unverified	0
OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search	Apr 25, 2024	Entity EmbeddingsImage Captioning	CodeCode Available	2
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers	Apr 21, 2024	DiagnosticImage Captioning	CodeCode Available	0
The Solution for the CVPR2024 NICE Image Captioning Challenge	Apr 19, 2024	Image CaptioningRetrieval	—Unverified	0
MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering	Apr 19, 2024	ChatbotDomain Adaptation	—Unverified	0
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?	Apr 16, 2024	Image CaptioningImage Generation	CodeCode Available	1
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis	Apr 15, 2024	DescriptiveImage Captioning	CodeCode Available	0
Bridging Vision and Language Spaces with Assignment Prediction	Apr 15, 2024	Cross-Modal RetrievalImage Captioning	CodeCode Available	0
On Speculative Decoding for Multimodal Large Language Models	Apr 13, 2024	Image CaptioningLanguage Modeling	—Unverified	0
FLoRA: Enhancing Vision-Language Models with Parameter-Efficient Federated Learning	Apr 12, 2024	Federated LearningImage Captioning	CodeCode Available	0
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts	Apr 12, 2024	Image CaptioningQuestion Answering	CodeCode Available	1
View Selection for 3D Captioning via Diffusion Ranking	Apr 11, 2024	3D Object CaptioningHallucination	CodeCode Available	3
Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation	Apr 6, 2024	Image CaptioningInstance Segmentation	—Unverified	0
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching	Apr 4, 2024	AttributeImage Captioning	CodeCode Available	2
Would Deep Generative Models Amplify Bias in Future Models?	Apr 4, 2024	Image CaptioningImage Generation	—Unverified	0
Jump Self-attention: Capturing High-order Statistics in Transformers	Apr 3, 2024	Image CaptioningNatural Language Understanding	—Unverified	0
Harnessing the Power of Large Vision Language Models for Synthetic Image Detection	Apr 3, 2024	Image CaptioningSynthetic Image Detection	CodeCode Available	1
Disentangled Pre-training for Human-Object Interaction Detection	Apr 2, 2024	Action RecognitionDecoder	CodeCode Available	1
Bi-LORA: A Vision-Language Approach for Synthetic Image Detection	Apr 2, 2024	Binary ClassificationImage Captioning	CodeCode Available	1
VLRM: Vision-Language Models act as Reward Models for Image Captioning	Apr 2, 2024	Image Captioningreinforcement-learning	—Unverified	0
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning	Apr 1, 2024	Image CaptioningInstruction Following	CodeCode Available	0
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction	Apr 1, 2024	Image CaptioningInstruction Following	—Unverified	0
VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis	Mar 29, 2024	HallucinationImage Captioning	CodeCode Available	2
A Review of Multi-Modal Large Language and Vision Models	Mar 28, 2024	Image CaptioningPrompt Engineering	—Unverified	0
LocCa: Visual Pretraining with Location-aware Captioners	Mar 28, 2024	DecoderImage Captioning	CodeCode Available	0
Text Data-Centric Image Captioning with Interactive Prompts	Mar 28, 2024	Image Captioning	—Unverified	0
Semantic Map-based Generation of Navigation Instructions	Mar 28, 2024	Image Captioning	CodeCode Available	0
A Survey on Large Language Models from Concept to Implementation	Mar 27, 2024	ChatbotImage Captioning	—Unverified	0
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction	Mar 27, 2024	Image CaptioningLanguage Modeling	CodeCode Available	2
Automated Report Generation for Lung Cytological Images Using a CNN Vision Classifier and Multiple-Transformer Text Decoders: Preliminary Study	Mar 26, 2024	DecoderImage Captioning	—Unverified	0
Semi-Supervised Image Captioning Considering Wasserstein Graph Matching	Mar 26, 2024	Data AugmentationGraph Matching	—Unverified	0
Visual Hallucination: Definition, Quantification, and Prescriptive Remediations	Mar 26, 2024	HallucinationImage Captioning	—Unverified	0
The Solution for the ICCV 2023 1st Scientific Figure Captioning Challenge	Mar 26, 2024	Caption GenerationImage Captioning	—Unverified	0
Image Captioning in news report scenario	Mar 24, 2024	Image CaptioningRecommendation Systems	—Unverified	0
Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content	Mar 23, 2024	DescriptiveImage Captioning	CodeCode Available	0
A Multimodal Approach for Cross-Domain Image Retrieval	Mar 22, 2024	Image CaptioningImage Retrieval	—Unverified	0
MyVLM: Personalizing VLMs for User-Specific Queries	Mar 21, 2024	Image CaptioningLanguage Modelling	—Unverified	0
Inserting Faces inside Captions: Image Captioning with Attention Guided Merging	Mar 20, 2024	Image CaptioningRetrieval	—Unverified	0
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs	Mar 20, 2024	Audio captioningImage Captioning	—Unverified	0
VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning	Mar 19, 2024	BenchmarkingImage Captioning	CodeCode Available	2
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition	Mar 19, 2024	Dense CaptioningImage Captioning	—Unverified	0

Show:10 25 50

← PrevPage 8 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified