Supervised Multimodal Bitransformers for Classifying Images and Text
2019-09-06Code Available1· sign in to hype
Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, Davide Testuggine
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/facebookresearch/mmbtOfficialIn paperpytorch★ 0
- github.com/huggingface/transformerspytorch★ 158,292
- github.com/ThilinaRajapakse/simpletransformerspytorch★ 4,235
- github.com/IsaacRodgz/multimodal-transformers-moviespytorch★ 11
- github.com/adriangrepo/mmbt_lightningpytorch★ 0
- github.com/IsaacRodgz/mmbt_experimentspytorch★ 0
Abstract
Self-supervised bidirectional transformer models such as BERT have led to dramatic improvements in a wide variety of textual classification tasks. The modern digital world is increasingly multimodal, however, and textual information is often accompanied by other modalities such as images. We introduce a supervised multimodal bitransformer model that fuses information from text and image encoders, and obtain state-of-the-art performance on various multimodal classification benchmark tasks, outperforming strong baselines, including on hard test sets specifically designed to measure multimodal performance.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| V-SNLI | MMBT | Accuracy | 90.5 | — | Unverified |