SOTAVerified

CanVEC - the Canberra Vietnamese-English Code-switching Natural Speech Corpus

2020-05-01LREC 2020Unverified0· sign in to hype

Li Nguyen, Christopher Bryant

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper introduces the Canberra Vietnamese-English Code-switching corpus (CanVEC), an original corpus of natural mixed speech that we semi-automatically annotated with language information, part of speech (POS) tags and Vietnamese translations. The corpus, which was built to inform a sociolinguistic study on language variation and code-switching, consists of 10 hours of recorded speech (87k tokens) between 45 Vietnamese-English bilinguals living in Canberra, Australia. We describe how we collected and annotated the corpus by pipelining several monolingual toolkits to considerably speed up the annotation process. We also describe how we evaluated the automatic annotations to ensure corpus reliability. We make the corpus available for research purposes.

Tasks

Reproductions