Feature Hashing for Language and Dialect Identification

2017-07-01ACL 2017Unverified0· sign in to hype

Shervin Malmasi, Mark Dras

Unverified — Be the first to reproduce this paper.

Abstract

We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse ( 99.5\%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86\%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

Tasks

Dialect Identification Dimensionality Reduction Information Retrieval Language Identification Machine Translation Text Categorization

Feature Hashing for Language and Dialect Identification

Abstract

Tasks

Reproductions