SOTAVerified

Text Segmentation of Digitized Clinical Texts

2016-05-01LREC 2016Unverified0· sign in to hype

Cyril Grouin

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

In this paper, we present the experiments we made to recover the original page layout structure into two columns from layout damaged digitized files. We designed several CRF-based approaches, either to identify column separator or to classify each token from each line into left or right columns. We achieved our best results with a model trained on homogeneous corpora (only files composed of 2 columns) when classifying each token into left or right columns (overall F-measure of 0.968). Our experiments show it is possible to recover the original layout in columns of digitized documents with results of quality.

Tasks

Reproductions