SOTAVerified

Automatic Detection and Language Identification of Multilingual Documents

2014-01-01TACL 2014Unverified0· sign in to hype

Marco Lui, Jey Han Lau, Timothy Baldwin

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Language identification is the task of automatically detecting the language(s) present in a document based on the content of the document. In this work, we address the problem of detecting documents that contain text from more than one language (multilingual documents). We introduce a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions. We demonstrate the effectiveness of our method over synthetic data, as well as real-world multilingual documents collected from the web.

Tasks

Reproductions