SOTAVerified

LexiDB: Patterns \& Methods for Corpus Linguistic Database Management

2020-05-01LREC 2020Unverified0· sign in to hype

Matthew Coole, Paul Rayson, John Mariani

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), it is designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added and deleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories of DBMSs and CMSs, more specialised to language data than a general purpose DBMS but more flexible than a traditional static corpus management system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale out for ever growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying of multi-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP) and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent with large corpora and when handling queries with large result sets

Tasks

Reproductions