Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

2020-10-02Findings of the Association for Computational LinguisticsUnverified0· sign in to hype

Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cecile Paris

Unverified — Be the first to reproduce this paper.

Abstract

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

Tasks

Clinical Concept Extraction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
2010 i2b2/VA	ClinicalBERT	Exact Span F1	87.4	—	Unverified

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Abstract

Tasks

Benchmark Results

Reproductions