SOTAVerified

An Extension of the Slovak Broadcast News Corpus based on Semi-Automatic Annotation

2016-05-01LREC 2016Unverified0· sign in to hype

Peter Viszlay, J{\'a}n Sta{\v{s}}, Tom{\'a}{\v{s}} Koct{\'u}r, Martin Lojka, Jozef Juh{\'a}r

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this paper, we introduce an extension of our previously released TUKE-BNews-SK corpus based on a semi-automatic annotation scheme. It firstly relies on the automatic transcription of the BN data performed by our Slovak large vocabulary continuous speech recognition system. The generated hypotheses are then manually corrected and completed by trained human annotators. The corpus is composed of 25 hours of fully-annotated spontaneous and prepared speech. In addition, we have acquired 900 hours of another BN data, part of which we plan to annotate semi-automatically. We present a preliminary corpus evaluation that gives very promising results.

Tasks

Reproductions