SOTAVerified

Biomedical Named Entity Recognition at Scale

2020-11-12Code Available0· sign in to hype

Veysel Kocaman, David Talby

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
AnatEMBLSTM-CNN-Char (SparkNLP)F189.13Unverified
BC2GMSpark NLPF188.75Unverified
BC4CHEMDBLSTM-CNN-Char (SparkNLP)F193.72Unverified
BC5CDRSpark NLPF189.73Unverified
BC5CDRBLSTM-CNN-Char (SparkNLP)F189.73Unverified
BC5CDR-chemicalSpark NLPF194.88Unverified
BioNLP13-CGBLSTM-CNN-Char (SparkNLP)F185.58Unverified
JNLPBASpark NLPF181.29Unverified
JNLPBABLSTM-CNN-Char (SparkNLP)F181.29Unverified
LINNAEUSBLSTM-CNN-Char (SparkNLP)F186.26Unverified
LINNAEUSSpark NLPF186.26Unverified
NCBI DiseaseSpark NLPF189.13Unverified
NCBI DiseaseBLSTM-CNN-Char (SparkNLP)F189.13Unverified
Species-800Spark NLPF180.91Unverified
Species800BLSTM-CNN-Char (SparkNLP)F180.91Unverified

Reproductions