SOTAVerified

DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

2023-05-30Code Available0· sign in to hype

Angel Beshirov, Suzan Hadzhieva, Ivan Koychev, Milena Dobreva

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.

Tasks

Reproductions