CATT: Character-based Arabic Tashkeel Transformer
Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/abjadai/cattOfficialIn paperpytorch★ 67
Abstract
Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research communityhttps://github.com/abjadai/catt.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| CATT | CATT ED | DER(%) | 8.62 | — | Unverified |
| CATT | CATT EO | DER(%) | 8.76 | — | Unverified |
| CATT | GPT-4 | DER(%) | 9.52 | — | Unverified |
| CATT | CBHG | DER(%) | 10.81 | — | Unverified |
| CATT | Command R+ | DER(%) | 13.17 | — | Unverified |
| CATT | Shakkala | DER(%) | 13.49 | — | Unverified |
| CATT | Sakhr | DER(%) | 13.84 | — | Unverified |
| CATT | Alkhalil | DER(%) | 14.23 | — | Unverified |
| CATT | Multilevel Diacritizer | DER(%) | 16.48 | — | Unverified |