The BDCam\~oes Collection of Portuguese Literary Documents: a Research Resource for Digital Humanities and Language Technology
Sara Grilo, M{\'a}rcia Bolrinha, Jo{\~a}o Silva, Rui Vaz, Ant{\'o}nio Branco
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper presents the BDCam\~oes Collection of Portuguese Literary Documents, a new corpus of literary texts written in Portuguese that in its inaugural version includes close to 4 million words from over 200 complete documents from 83 authors in 14 genres, covering a time span from the 16th to the 21st century, and adhering to different orthographic conventions. Many of the texts in the corpus have also been automatically parsed with state-of-the-art language processing tools, forming the BDCam\~oes Treebank subcorpus. This set of characteristics makes of BDCam\~oes an invaluable resource for research in language technology (e.g. authorship detection, genre classification, etc.) and in language science and digital humanities (e.g. comparative literature, diachronic linguistics, etc.).