Transformer Based Punctuation Restoration for Turkish

2023-09-158th International Conference on Computer Science and Engineering (UBMK) 2023Code Available0· sign in to hype

Uygar Kurt, Aykut Çayır

Code Available — Be the first to reproduce this paper.

Code

github.com/uygarkurt/Turkish-Punctuation-Restoration
pytorch★ 8

Abstract

Mobile devices and social media platforms make communication faster than humans have had before, thanks to the technologies such as automatic speech recognition(ASR). However, the speed in text-based communication methods leads to several mistakes that could be solved. The two well-known mistakes are grammatical errors and forgotten punctuation usage. The punctuation restoration task is inherited from the automatic speech recognition domain. Understanding and restoring correct places of punctuation are challenging problems in speech recognition. However, no datasets exist to train a punctuation restoration model for the Turkish language. This paper focuses on restoring punctuations in Turkish texts and introduces a new Turkish dataset for punctuation restoration. Three transformer models: BERT, ELECTRA, and ConvBERT, are fine-tuned and tested on the newly created dataset for three distinct labels: PERIOD, COMMA, and QUESTION MARK. Benchmark results in the paper are reported regarding precision, recall, and F1 score due to imbalanced class distribution. Although each model shows similar performance scores, ELECTRA reaches 83.9% F1 score overall.

Tasks

Automatic Speech Recognition Automatic Speech Recognition (ASR)Punctuation Restoration speech-recognition Speech Recognition

Transformer Based Punctuation Restoration for Turkish

Code

Abstract

Tasks

Reproductions