SciBERT: A Pretrained Language Model for Scientific Text
Iz Beltagy, Kyle Lo, Arman Cohan
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/allenai/scibertOfficialIn paperpytorch★ 0
- github.com/tetsu9923/scireviewgenpytorch★ 19
- github.com/georgetown-cset/ai-relevant-paperspytorch★ 14
- github.com/hoangcuongnguyen2001/scibert-for-technique-classificationnone★ 0
- github.com/kuldeep7688/BioMedicalBertNerpytorch★ 0
- github.com/charles9n/bert-sklearnpytorch★ 0
Abstract
Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| GENIA - LAS | SciBERT (Base Vocab) | F1 | 91.26 | — | Unverified |
| GENIA - LAS | SciBERT (SciVocab) | F1 | 91.41 | — | Unverified |
| GENIA - UAS | SciBERT (Base Vocab) | F1 | 92.32 | — | Unverified |
| GENIA - UAS | SciBERT (SciVocab) | F1 | 92.46 | — | Unverified |