DuoSearch: A Novel Search Engine for Bulgarian Historical Documents
Angel Beshirov, Suzan Hadzhieva, Ivan Koychev, Milena Dobreva
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/angelbeshirov/duosearchOfficialIn papernone★ 0
Abstract
Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.