SOTAVerified

Zero-shot Cross Language Text Classification

2018-01-01ICLR 2018Unverified0· sign in to hype

Dan Svenstrup, Jonas Meinertz Hansen, Ole Winther

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Labeled text classification datasets are typically only available in a few select languages. In order to train a model for e.g news categorization in a language L_t without a suitable text classification dataset there are two options. The first option is to create a new labeled dataset by hand, and the second option is to transfer label information from an existing labeled dataset in a source language L_s to the target language L_t. In this paper we propose a method for sharing label information across languages by means of a language independent text encoder. The encoder will give almost identical representations to multilingual versions of the same text. This means that labeled data in one language can be used to train a classifier that works for the rest of the languages. The encoder is trained independently of any concrete classification task and can therefore subsequently be used for any classification task. We show that it is possible to obtain good performance even in the case where only a comparable corpus of texts is available.

Tasks

Reproductions