Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 951–1000 of 1878 papers

Title	Date	Tasks	Status	Hype
Deep Learning Approaches on Image Captioning: A Review	Jan 31, 2022	Caption GenerationDeep Learning	—Unverified	0
A Frustratingly Simple Approach for End-to-End Image Captioning	Jan 30, 2022	DecoderImage Captioning	—Unverified	0
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	Jan 28, 2022	Image CaptioningImage-text matching	CodeCode Available	5
An Integrated Approach for Video Captioning and Applications	Jan 23, 2022	Image CaptioningVideo Captioning	—Unverified	0
Visual Information Guided Zero-Shot Paraphrase Generation	Jan 22, 2022	DiversityImage Captioning	CodeCode Available	0
Discovering Non-Monotonic Autoregressive Ordering for Text Generation Models using Sinkhorn Distributions	Jan 17, 2022	Code GenerationDecoder	—Unverified	0
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset	Jan 16, 2022	Image CaptioningModel Selection	—Unverified	0
Transparent Human Evaluation for Image Captioning	Jan 16, 2022	Image Captioning	—Unverified	0
All You May Need for VQA are Image Captions	Jan 16, 2022	AllImage Captioning	—Unverified	0
Long-Tail Classification for Distinctive Image Captioning: A Simple yet Effective Remedy for Side Effects of Reinforcement Learning	Jan 16, 2022	Image CaptioningReinforcement Learning (RL)	—Unverified	0
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand	Jan 16, 2022	Image CaptioningMachine Translation	—Unverified	0
Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training	Jan 11, 2022	DecoderImage Captioning	—Unverified	0
Repurposing Existing Deep Networks for Caption and Aesthetic-Guided Image Cropping	Jan 7, 2022	Image CaptioningImage Cropping	—Unverified	0
Compact Bidirectional Transformer for Image Captioning	Jan 6, 2022	DecoderImage Captioning	CodeCode Available	1
Synthesizer Based Efficient Self-Attention for Vision Tasks	Jan 5, 2022	Image Captioningimage-classification	—Unverified	0
Interactive Attention AI to translate low light photos to captions for night scene understanding in women safety	Jan 4, 2022	DecoderDeep Learning	—Unverified	0
StyleM: Stylized Metrics for Image Captioning Built with Contrastive N-grams	Jan 4, 2022	Image Captioning	—Unverified	0
DIFNet: Boosting Visual Information Flow for Image Captioning	Jan 1, 2022	Image CaptioningPrediction	—Unverified	0
DeeCap: Dynamic Early Exiting for Efficient Image Captioning	Jan 1, 2022	Image CaptioningImitation Learning	CodeCode Available	1
Show, Deconfound and Tell: Image Captioning With Causal Inference	Jan 1, 2022	Causal InferenceDecoder	CodeCode Available	1
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation	Dec 31, 2021	Image CaptioningImage Generation	CodeCode Available	1
Knowledge Matters: Radiology Report Generation with General and Specific Knowledge	Dec 30, 2021	DecoderGeneral Knowledge	—Unverified	0
Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) -- Team: MMCUniAugsburg	Dec 28, 2021	Image Captioning	—Unverified	0
Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation	Dec 28, 2021	Image CaptioningMachine Translation	—Unverified	0
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision	Dec 27, 2021	ClassificationImage Captioning	—Unverified	0
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks	Dec 13, 2021	Image CaptioningTransfer Learning	CodeCode Available	1
MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning	Dec 13, 2021	Caption GenerationDescriptive	—Unverified	0
Injecting Semantic Concepts into End-to-End Image Captioning	Dec 9, 2021	Caption GenerationImage Captioning	CodeCode Available	1
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand	Dec 8, 2021	Image CaptioningMachine Translation	CodeCode Available	1
Protecting Intellectual Property of Language Generation APIs with Lexical Watermark	Dec 5, 2021	Document SummarizationImage Captioning	CodeCode Available	0
Consensus Graph Representation Learning for Better Grounded Image Captioning	Dec 2, 2021	Graph Representation LearningHallucination	—Unverified	0
Object-Centric Unsupervised Image Captioning	Dec 2, 2021	Image CaptioningObject	CodeCode Available	0
Image2tweet: Datasets in Hindi and English for Generating Tweets from Images	Dec 1, 2021	Image CaptioningWorld Knowledge	CodeCode Available	0
A Scaled Encoder Decoder Network for Image Captioning in Hindi	Dec 1, 2021	DecoderDeep Learning	—Unverified	0
Image Caption Generation Framework for Assamese News using Attention Mechanism	Dec 1, 2021	Caption GenerationDecoder	—Unverified	0
Set Prediction in the Latent Space	Dec 1, 2021	Image Captioningobject-detection	CodeCode Available	0
Neural Attention for Image Captioning: Review of Outstanding Methods	Nov 29, 2021	DecoderDeep Learning	—Unverified	0
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic	Nov 29, 2021	Contrastive LearningDescriptive	CodeCode Available	1
Scene Graph Generation with Geometric Context	Nov 25, 2021	Activity RecognitionGraph Generation	—Unverified	0
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets	Nov 24, 2021	DescriptiveImage Captioning	—Unverified	0
Scaling Up Vision-Language Pre-training for Image Captioning	Nov 24, 2021	AttributeImage Captioning	—Unverified	0
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling	Nov 23, 2021	Image CaptioningImage Description	CodeCode Available	1
L-Verse: Bidirectional Generation Between Image and Text	Nov 22, 2021	Image CaptioningImage Generation	CodeCode Available	1
UFO: A UniFied TransfOrmer for Vision-Language Representation Learning	Nov 19, 2021	Image CaptioningImage-text matching	—Unverified	0
ClipCap: CLIP Prefix for Image Captioning	Nov 18, 2021	Image CaptioningLanguage Modeling	CodeCode Available	2
Transparent Human Evaluation for Image Captioning	Nov 17, 2021	Image Captioning	CodeCode Available	1
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation	Nov 16, 2021	Image CaptioningKnowledge Distillation	—Unverified	0
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer	Nov 16, 2021	Image CaptioningLanguage Modeling	—Unverified	0
On Vision Features in Multimodal Machine Translation	Nov 16, 2021	Image CaptioningMachine Translation	—Unverified	0
Temporal Knowledge-Aware Image Captioning	Nov 16, 2021	Caption GenerationImage Captioning	—Unverified	0

Show:10 25 50

← PrevPage 20 of 38Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	IBM Research AI	CIDEr	80.67	—	Unverified
2	CASIA_IVA	CIDEr	79.15	—	Unverified
3	feixiang	CIDEr	77.31	—	Unverified
4	wocao	CIDEr	77.21	—	Unverified
5	lamiwab172	CIDEr	75.93	—	Unverified
6	RUC_AIM3	CIDEr	73.52	—	Unverified
7	funas	CIDEr	73.51	—	Unverified
8	SRC-B_VCLab	CIDEr	73.47	—	Unverified
9	sparta	CIDEr	73.41	—	Unverified
10	x-viz	CIDEr	73.26	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	VALOR	CIDER	152.5	—	Unverified
2	VAST	CIDER	149	—	Unverified
3	Virtex (ResNet-101)	CIDER	94	—	Unverified
4	KOSMOS-1 (1.6B) (zero-shot)	CIDER	84.7	—	Unverified
5	BLIP-FuseCap	CLIPScore	78.5	—	Unverified
6	mPLUG	BLEU-4	46.5	—	Unverified
7	OFA	BLEU-4	44.9	—	Unverified
8	GIT	BLEU-4	44.1	—	Unverified
9	BLIP-2 ViT-G OPT 2.7B (zero-shot)	BLEU-4	43.7	—	Unverified
10	BLIP-2 ViT-G OPT 6.7B (zero-shot)	BLEU-4	43.5	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	149.1	—	Unverified
2	GIT2, Single Model	CIDEr	124.18	—	Unverified
3	GIT, Single Model	CIDEr	122.4	—	Unverified
4	PaLI	CIDEr	121.09	—	Unverified
5	CoCa - Google Brain	CIDEr	117.9	—	Unverified
6	Microsoft Cognitive Services team	CIDEr	112.82	—	Unverified
7	Single Model	CIDEr	108.98	—	Unverified
8	GRIT (zero-shot, no VL pretraining, no CBS)	CIDEr	105.9	—	Unverified
9	FudanFVL	CIDEr	104.9	—	Unverified
10	FudanWYZ	CIDEr	104.25	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	GIT2, Single Model	CIDEr	125.51	—	Unverified
2	PaLI	CIDEr	124.35	—	Unverified
3	GIT, Single Model	CIDEr	123.92	—	Unverified
4	CoCa - Google Brain	CIDEr	120.73	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	115.54	—	Unverified
6	Single Model	CIDEr	110.76	—	Unverified
7	FudanFVL	CIDEr	109.33	—	Unverified
8	FudanWYZ	CIDEr	108.04	—	Unverified
9	IEDA-LAB	CIDEr	100.15	—	Unverified
10	firethehole	CIDEr	99.51	—	Unverified

#	Model	Metric	Claimed	Verified	Status
1	PaLI	CIDEr	126.67	—	Unverified
2	GIT2, Single Model	CIDEr	122.27	—	Unverified
3	GIT, Single Model	CIDEr	122.04	—	Unverified
4	CoCa - Google Brain	CIDEr	121.69	—	Unverified
5	Microsoft Cognitive Services team	CIDEr	110.14	—	Unverified
6	Single Model	CIDEr	109.49	—	Unverified
7	FudanFVL	CIDEr	106.55	—	Unverified
8	FudanWYZ	CIDEr	103.75	—	Unverified
9	Human	CIDEr	91.62	—	Unverified
10	firethehole	CIDEr	88.54	—	Unverified