SOTAVerified

Transformer Based Punctuation Restoration for Turkish

2023-09-158th International Conference on Computer Science and Engineering (UBMK) 2023Code Available0· sign in to hype

Uygar Kurt, Aykut Çayır

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Mobile devices and social media platforms make communication faster than humans have had before, thanks to the technologies such as automatic speech recognition(ASR). However, the speed in text-based communication methods leads to several mistakes that could be solved. The two well-known mistakes are grammatical errors and forgotten punctuation usage. The punctuation restoration task is inherited from the automatic speech recognition domain. Understanding and restoring correct places of punctuation are challenging problems in speech recognition. However, no datasets exist to train a punctuation restoration model for the Turkish language. This paper focuses on restoring punctuations in Turkish texts and introduces a new Turkish dataset for punctuation restoration. Three transformer models: BERT, ELECTRA, and ConvBERT, are fine-tuned and tested on the newly created dataset for three distinct labels: PERIOD, COMMA, and QUESTION MARK. Benchmark results in the paper are reported regarding precision, recall, and F1 score due to imbalanced class distribution. Although each model shows similar performance scores, ELECTRA reaches 83.9% F1 score overall.

Tasks

Reproductions