PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

2023-11-21Code Available0· sign in to hype

Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford

Code Available — Be the first to reproduce this paper.

Code

github.com/clicknext-ai/phayathaibert
OfficialIn paperpytorch★ 4

Abstract

While WangchanBERTa has become the de facto standard in transformer-based Thai language modeling, it still has shortcomings in regard to the understanding of foreign words, most notably English words, which are often borrowed without orthographic assimilation into Thai in many contexts. We identify the lack of foreign vocabulary in WangchanBERTa's tokenizer as the main source of these shortcomings. We then expand WangchanBERTa's vocabulary via vocabulary transfer from XLM-R's pretrained tokenizer and pretrain a new model using the expanded tokenizer, starting from WangchanBERTa's checkpoint, on a new dataset that is larger than the one used to train WangchanBERTa. Our results show that our new pretrained model, PhayaThaiBERT, outperforms WangchanBERTa in many downstream tasks and datasets.

Tasks

Language Modeling Language Modelling

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Code

Abstract

Tasks

Reproductions