Qualitative and Quantitative Analysis of Diversity in Cross-document Coreference Resolution Datasets
Anonymous
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Established cross-document coreference resolution (CDCR) datasets contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations. In this paper, we qualitatively and quantitatively compare the annotation schemes of ECB+, a CDCR dataset with identity coreference relations, and NewsWCL50, a CDCR dataset with identity, bridging, and near-identity coreference relations. The analysis shows that coreference chains of NewsWCL50 are more lexically diverse ECB+ but annotating of NewsWCL50 leads to the lower inter-coder reliability. We propose a phrasing diversity metric (PD) that encounters for the diversity of full phrases unlike the previously proposed metrics. We discuss the different tasks that both CDCR datasets create, i.e., lexical disambiguation and lexical diversity challenges for CDCR models, and propose a direction for further CDCR evaluation.