Diacritics Restoration using BERT with Analysis on Czech language
2021-05-24Code Available0· sign in to hype
Jakub Náplava, Milan Straka, Jana Straková
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/ufal/bert-diacritics-restorationOfficialIn paperpytorch★ 7
Abstract
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but either plausible variants (19%), or the system corrections of erroneous data (25%). Finally, we categorize the real errors in detail. We release the code at https://github.com/ufal/bert-diacritics-restoration.
Tasks
Croatian Text DiacritizationCzech Text DiacritizationFrench Text DiacritizationHungarian Text DiacritizationIrish Text DiacritizationLatvian Text DiacritizationPolish Text DiacritizationRomanian Text DiacritizationSlovak Text DiacritizationSpanish Text DiacritizationTurkish Text DiacritizationVietnamese Text Diacritization
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.71 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.66 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.62 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.41 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.32 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.22 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 98.95 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 98.88 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 98.64 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 98.63 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 98.53 | — | Unverified |
| Multilingual Dataset for Training and Evaluating Diacritics Restoration Systems | BERT | Alpha-Word accuracy | 99.73 | — | Unverified |