GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing
2022-10-19Code Available1· sign in to hype
Siyao Peng, Yang Janet Liu, Amir Zeldes
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/logan-siyao-peng/gcdtOfficialIn paperpytorch★ 12
Abstract
A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.