SOTAVerified

A Two-Step Approach for Automatic OCR Post-Correction

2020-12-01COLING (LaTeCHCLfL, CLFL, LaTeCH) 2020Code Available1· sign in to hype

Robin Schaefer, Clemens Neudecker

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The quality of Optical Character Recognition (OCR) is a key factor in the digitisation of historical documents. OCR errors are a major obstacle for downstream tasks and have hindered advances in the usage of the digitised documents. In this paper we present a two-step approach to automatic OCR post-correction. The first component is responsible for detecting erroneous sequences in a set of OCRed texts, while the second is designed for correcting OCR errors in them. We show that applying the preceding detection model reduces both the character error rate (CER) compared to a simple one-step correction model and the amount of falsely changed correct characters.

Tasks

Reproductions