Pansori: ASR Corpus Generation from Open Online Video Contents
2018-12-23Code Available0· sign in to hype
Yoona Choi, Bowon Lee
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/yc9701/pansori-tedxkr-corpusOfficialIn papernone★ 0
- github.com/freds0/kabookspytorch★ 0
- github.com/freds0/katubenone★ 0
Abstract
This paper introduces Pansori, a program used to create ASR (automatic speech recognition) corpora from online video contents. It utilizes a cloud-based speech API to easily create a corpus in different languages. Using this program, we semi-automatically generated the Pansori-TEDxKR dataset from Korean TED conference talks with community-transcribed subtitles. It is the first high-quality corpus for the Korean language freely available for independent research. Pansori is released as an open-source software and the generated corpus is released under a permissive public license for community use and participation.