Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–10 of 1878 papers

Title	Date	Tasks	Status	Hype
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos	Jul 16, 2025	Image CaptioningRepresentation Learning	—Unverified	0
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval	Jun 28, 2025	Cross-Modal RetrievalImage Captioning	—Unverified	0
HalLoc: Token-level Localization of Hallucinations for Vision Language Models	Jun 12, 2025	HallucinationImage Captioning	CodeCode Available	0
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning	Jun 11, 2025	Image CaptioningMath	CodeCode Available	2
A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning	Jun 11, 2025	DecoderImage Captioning	—Unverified	0
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs	Jun 11, 2025	Code GenerationDiagnostic	CodeCode Available	1
An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models	Jun 10, 2025	Action GenerationImage Captioning	—Unverified	0
Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring	Jun 10, 2025	Image Captioning	—Unverified	0
Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings	Jun 10, 2025	Image Captioning	CodeCode Available	0
Edit Flows: Flow Matching with Edit Operations	Jun 10, 2025	Code GenerationImage Captioning	—Unverified	0

Show:10 25 50

← PrevPage 1 of 188Next →

All datasets VizWiz 2020 test-dev COCO Captions nocaps in-domain nocaps near-domain nocaps out-of-domain nocaps entire COCO (Common Objects in Context)VizWiz 2020 test nocaps-XD entire nocaps-val-in-domain nocaps-val-overall nocaps-XD in-domain

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	BLIP-2 ViT-G FlanT5 XL (zero-shot)	CIDEr	123.7	—	Unverified
2	BLIP-2 ViT-G OPT 6.7B (zero-shot)	CIDEr	123.7	—	Unverified
3	BLIP-2 ViT-G OPT 2.7B (zero-shot)	CIDEr	123	—	Unverified
4	LEMON_large	CIDEr	116.9	—	Unverified
5	BLIP_ViT-L	CIDEr	114.9	—	Unverified
6	SimVLM	CIDEr	113.7	—	Unverified
7	BLIP_CapFilt-L	CIDEr	111.8	—	Unverified
8	LEMON_base	CIDEr	107.7	—	Unverified
9	OmniVL	CIDEr	104.6	—	Unverified
10	VinVL	CIDEr	103.1	—	Unverified