The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
2020-12-01INLG (ACL) 2020Unverified0· sign in to hype
Chris van der Lee, Chris Emmery, Sander Wubben, Emiel Krahmer
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.