SOTAVerified

Image Captioning

Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.

( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)

Papers

Showing 110 of 1878 papers

Show:102550
← PrevPage 1 of 188Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GIT2CIDEr124.77Unverified
2GITCIDEr123.39Unverified
3Microsoft Cognitive Services teamCIDEr114.25Unverified
4VLAF2CIDEr102.39Unverified
5Microsoft Cognitive Services teamCIDEr100.12Unverified
6HumanCIDEr85.34Unverified
7icp2ssi1_coco_si_0.02_5_testCIDEr85.3Unverified
8test_cbs2CIDEr85.02Unverified
9UpDown + ELMo + CBSCIDEr73.09Unverified
10Neural Baby Talk + CBSCIDEr61.48Unverified