SOTAVerified

Subdialectal Differences in Sorani Kurdish

2016-12-01WS 2016Unverified0· sign in to hype

Shervin Malmasi

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this study we apply classification methods for detecting subdialectal differences in Sorani Kurdish texts produced in different regions, namely Iran and Iraq. As Sorani is a low-resource language, no corpus including texts from different regions was readily available. To this end, we identified data sources that could be leveraged for this task to create a dataset of 200,000 sentences. Using surface features, we attempted to classify Sorani subdialects, showing that sentences from news sources in Iraq and Iran are distinguishable with 96\% accuracy. This is the first preliminary study for a dialect that has not been widely studied in computational linguistics, evidencing the possible existence of distinct subdialects.

Tasks

Reproductions