SciBERT: A Pretrained Language Model for Scientific Text

2019-03-26IJCNLP 2019Code Available1· sign in to hype

Iz Beltagy, Kyle Lo, Arman Cohan

Code Available — Be the first to reproduce this paper.

Code

github.com/allenai/scibert
OfficialIn paperpytorch★ 0
github.com/tetsu9923/scireviewgen
pytorch★ 19
github.com/georgetown-cset/ai-relevant-papers
pytorch★ 14
github.com/hoangcuongnguyen2001/scibert-for-technique-classification
none★ 0
github.com/kuldeep7688/BioMedicalBertNer
pytorch★ 0
github.com/charles9n/bert-sklearn
pytorch★ 0

Abstract

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

Tasks

Citation Intent Classification Dependency Parsing General Classification Language Modeling Language Modelling Medical Named Entity Recognition model Named Entity Recognition (NER)Participant Intervention Comparison Outcome Extraction Relation Extraction Sentence Sentence Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
GENIA - LAS	SciBERT (Base Vocab)	F1	91.26	—	Unverified
GENIA - LAS	SciBERT (SciVocab)	F1	91.41	—	Unverified
GENIA - UAS	SciBERT (Base Vocab)	F1	92.32	—	Unverified
GENIA - UAS	SciBERT (SciVocab)	F1	92.46	—	Unverified

SciBERT: A Pretrained Language Model for Scientific Text

Code

Abstract

Tasks

Benchmark Results

Reproductions