SOTAVerified

Diacritics Restoration using BERT with Analysis on Czech language

2021-05-24Code Available0· sign in to hype

Jakub Náplava, Milan Straka, Jana Straková

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.71Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.66Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.62Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.41Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.32Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.22Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy98.95Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy98.88Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy98.64Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy98.63Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy98.53Unverified
Multilingual Dataset for Training and Evaluating Diacritics Restoration SystemsBERTAlpha-Word accuracy99.73Unverified

Reproductions