Biomedical Named Entity Recognition at Scale

2020-11-12Code Available0· sign in to hype

Veysel Kocaman, David Talby

Code Available — Be the first to reproduce this paper.

Code

github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.4.Biomedical_NER_SparkNLP_paper_reproduce.ipynb
Officialnone★ 0

Abstract

Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Tasks

De-identification Entity Resolution GPU Information Retrieval Medical Named Entity Recognition named-entity-recognition Named Entity Recognition Named Entity Recognition (NER)NER Question Answering Relation Extraction Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AnatEM	BLSTM-CNN-Char (SparkNLP)	F1	89.13	—	Unverified
BC2GM	Spark NLP	F1	88.75	—	Unverified
BC4CHEMD	BLSTM-CNN-Char (SparkNLP)	F1	93.72	—	Unverified
BC5CDR	Spark NLP	F1	89.73	—	Unverified
BC5CDR	BLSTM-CNN-Char (SparkNLP)	F1	89.73	—	Unverified
BC5CDR-chemical	Spark NLP	F1	94.88	—	Unverified
BioNLP13-CG	BLSTM-CNN-Char (SparkNLP)	F1	85.58	—	Unverified
JNLPBA	Spark NLP	F1	81.29	—	Unverified
JNLPBA	BLSTM-CNN-Char (SparkNLP)	F1	81.29	—	Unverified
LINNAEUS	BLSTM-CNN-Char (SparkNLP)	F1	86.26	—	Unverified
LINNAEUS	Spark NLP	F1	86.26	—	Unverified
NCBI Disease	Spark NLP	F1	89.13	—	Unverified
NCBI Disease	BLSTM-CNN-Char (SparkNLP)	F1	89.13	—	Unverified
Species-800	Spark NLP	F1	80.91	—	Unverified
Species800	BLSTM-CNN-Char (SparkNLP)	F1	80.91	—	Unverified

Biomedical Named Entity Recognition at Scale

Code

Abstract

Tasks

Benchmark Results

Reproductions