SOTAVerified

GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing

2022-10-19Code Available1· sign in to hype

Siyao Peng, Yang Janet Liu, Amir Zeldes

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.

Tasks

Reproductions