Context-Aware Text Normalisation for Historical Dialects

2020-12-01COLING 2020Unverified0· sign in to hype

Maria Sukhareva

Unverified — Be the first to reproduce this paper.

Abstract

Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source.

Tasks

Language Modeling Language Modelling Machine Translation Reranking Transfer Learning Translation

Context-Aware Text Normalisation for Historical Dialects

Abstract

Tasks

Reproductions