Image Captioning
Image Captioning is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence. The most popular benchmarks are nocaps and COCO, and models are typically evaluated according to a BLEU or CIDER metric.
( Image credit: Reflective Decoding Network for Image Captioning, ICCV'19)
Papers
Showing 1–10 of 1878 papers
All datasetsVizWiz 2020 test-devCOCO Captionsnocaps in-domainnocaps near-domainnocaps out-of-domainnocaps entireCOCO (Common Objects in Context)VizWiz 2020 testnocaps-XD entirenocaps-val-in-domainnocaps-val-overallnocaps-XD in-domain
Benchmark Results
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | IBM Research AI | CIDEr | 80.67 | — | Unverified |
| 2 | CASIA_IVA | CIDEr | 79.15 | — | Unverified |
| 3 | feixiang | CIDEr | 77.31 | — | Unverified |
| 4 | wocao | CIDEr | 77.21 | — | Unverified |
| 5 | lamiwab172 | CIDEr | 75.93 | — | Unverified |
| 6 | RUC_AIM3 | CIDEr | 73.52 | — | Unverified |
| 7 | funas | CIDEr | 73.51 | — | Unverified |
| 8 | SRC-B_VCLab | CIDEr | 73.47 | — | Unverified |
| 9 | sparta | CIDEr | 73.41 | — | Unverified |
| 10 | x-viz | CIDEr | 73.26 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | VALOR | CIDER | 152.5 | — | Unverified |
| 2 | VAST | CIDER | 149 | — | Unverified |
| 3 | Virtex (ResNet-101) | CIDER | 94 | — | Unverified |
| 4 | KOSMOS-1 (1.6B) (zero-shot) | CIDER | 84.7 | — | Unverified |
| 5 | BLIP-FuseCap | CLIPScore | 78.5 | — | Unverified |
| 6 | mPLUG | BLEU-4 | 46.5 | — | Unverified |
| 7 | OFA | BLEU-4 | 44.9 | — | Unverified |
| 8 | GIT | BLEU-4 | 44.1 | — | Unverified |
| 9 | BLIP-2 ViT-G OPT 2.7B (zero-shot) | BLEU-4 | 43.7 | — | Unverified |
| 10 | BLIP-2 ViT-G OPT 6.7B (zero-shot) | BLEU-4 | 43.5 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLI | CIDEr | 149.1 | — | Unverified |
| 2 | GIT2, Single Model | CIDEr | 124.18 | — | Unverified |
| 3 | GIT, Single Model | CIDEr | 122.4 | — | Unverified |
| 4 | PaLI | CIDEr | 121.09 | — | Unverified |
| 5 | CoCa - Google Brain | CIDEr | 117.9 | — | Unverified |
| 6 | Microsoft Cognitive Services team | CIDEr | 112.82 | — | Unverified |
| 7 | Single Model | CIDEr | 108.98 | — | Unverified |
| 8 | GRIT (zero-shot, no VL pretraining, no CBS) | CIDEr | 105.9 | — | Unverified |
| 9 | FudanFVL | CIDEr | 104.9 | — | Unverified |
| 10 | FudanWYZ | CIDEr | 104.25 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | GIT2, Single Model | CIDEr | 125.51 | — | Unverified |
| 2 | PaLI | CIDEr | 124.35 | — | Unverified |
| 3 | GIT, Single Model | CIDEr | 123.92 | — | Unverified |
| 4 | CoCa - Google Brain | CIDEr | 120.73 | — | Unverified |
| 5 | Microsoft Cognitive Services team | CIDEr | 115.54 | — | Unverified |
| 6 | Single Model | CIDEr | 110.76 | — | Unverified |
| 7 | FudanFVL | CIDEr | 109.33 | — | Unverified |
| 8 | FudanWYZ | CIDEr | 108.04 | — | Unverified |
| 9 | IEDA-LAB | CIDEr | 100.15 | — | Unverified |
| 10 | firethehole | CIDEr | 99.51 | — | Unverified |
| # | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| 1 | PaLI | CIDEr | 126.67 | — | Unverified |
| 2 | GIT2, Single Model | CIDEr | 122.27 | — | Unverified |
| 3 | GIT, Single Model | CIDEr | 122.04 | — | Unverified |
| 4 | CoCa - Google Brain | CIDEr | 121.69 | — | Unverified |
| 5 | Microsoft Cognitive Services team | CIDEr | 110.14 | — | Unverified |
| 6 | Single Model | CIDEr | 109.49 | — | Unverified |
| 7 | FudanFVL | CIDEr | 106.55 | — | Unverified |
| 8 | FudanWYZ | CIDEr | 103.75 | — | Unverified |
| 9 | Human | CIDEr | 91.62 | — | Unverified |
| 10 | firethehole | CIDEr | 88.54 | — | Unverified |