Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
2024-07-04Code Available0· sign in to hype
Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera, Rubén Manrique
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/historicalink/LatamXIXOfficialnone★ 2
Abstract
This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.