CoNECo: A corpus for named entity recognition and normalization of protein complexes

2024-05-18bioRxiv 2024Unverified0· sign in to hype

Katerina Nastou, Mikaela Koutrouli, Sampo Pyysalo, Lars Juhl Jensen

Unverified — Be the first to reproduce this paper.

Abstract

Motivation: Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus. Results: We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F1-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature. Availability: All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.

Tasks

named-entity-recognition Named Entity Recognition Named Entity Recognition (NER)NER

CoNECo: A corpus for named entity recognition and normalization of protein complexes

Abstract

Tasks

Reproductions