QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

2016-05-01LREC 2016Unverified0· sign in to hype

Arantxa Otegi, Nora Aranberri, Antonio Branco, Jan Haji{\v{c}}, Martin Popel, Kiril Simov, Eneko Agirre, Petya Osenova, Rita Pereira, Jo{\~a}o Silva, Steven Neale

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

Tasks

Cross-Lingual Transfer Entity Disambiguation General Classification LEMMA Machine Translation named-entity-recognition Named Entity Recognition Named Entity Recognition (NER)Part-Of-Speech Tagging Translation Word Sense Disambiguation

QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages

Abstract

Tasks

Reproductions