Multimodal Item Categorization Fully Based on Transformer

2021-08-01ACL (ECNLP) 2021Unverified0· sign in to hype

Lei Chen, Houwei Chou, Yandi Xia, Hirokazu Miyake

Unverified — Be the first to reproduce this paper.

Abstract

The Transformer has proven to be a powerful feature extraction method and has gained widespread adoption in natural language processing (NLP). In this paper we propose a multimodal item categorization (MIC) system solely based on the Transformer for both text and image processing. On a multimodal product data set collected from a Japanese e-commerce giant, we tested a new image classification model based on the Transformer and investigated different ways of fusing bi-modal information. Our experimental results on real industry data showed that the Transformer-based image classifier has performance on par with ResNet-based classifiers and is four times faster to train. Furthermore, a cross-modal attention layer was found to be critical for the MIC system to achieve performance gains over text-only and image-only models.

Tasks

image-classification Image Classification

Multimodal Item Categorization Fully Based on Transformer

Abstract

Tasks

Reproductions