CATT: Character-based Arabic Tashkeel Transformer

2024-07-03Code Available2· sign in to hype

Faris Alasmary, Orjuwan Zaafarani, Ahmad Ghannam

Code Available — Be the first to reproduce this paper.

Code

github.com/abjadai/catt
OfficialIn paperpytorch★ 67

Abstract

Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research communityhttps://github.com/abjadai/catt.

Tasks

Arabic Text Diacritization Decoder Machine Translation text-to-speech Text to Speech

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
CATT	CATT ED	DER(%)	8.62	—	Unverified
CATT	CATT EO	DER(%)	8.76	—	Unverified
CATT	GPT-4	DER(%)	9.52	—	Unverified
CATT	CBHG	DER(%)	10.81	—	Unverified
CATT	Command R+	DER(%)	13.17	—	Unverified
CATT	Shakkala	DER(%)	13.49	—	Unverified
CATT	Sakhr	DER(%)	13.84	—	Unverified
CATT	Alkhalil	DER(%)	14.23	—	Unverified
CATT	Multilevel Diacritizer	DER(%)	16.48	—	Unverified

CATT: Character-based Arabic Tashkeel Transformer

Code

Abstract

Tasks

Benchmark Results

Reproductions