Sampling Bias in Deep Active Classification: An Empirical Study

2019-09-20IJCNLP 2019Code Available0· sign in to hype

Ameya Prabhu, Charles Dognin, Maneesh Singh

Code Available — Be the first to reproduce this paper.

Code

github.com/drimpossible/Sampling-Bias-Active-Learning
OfficialIn papernone★ 0
github.com/Xtra-Computing/thundersvm
OfficialIn papernone★ 0

Abstract

The exploding cost and time needed for data labeling and model training are bottlenecks for training DNN models on large datasets. Identifying smaller representative data samples with strategies like active learning can help mitigate such bottlenecks. Previous works on active learning in NLP identify the problem of sampling bias in the samples acquired by uncertainty-based querying and develop costly approaches to address it. Using a large empirical study, we demonstrate that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) is robust to sampling biases and to various algorithmic choices (query size and strategies) unlike that suggested by traditional literature. We also show that FTZ based query strategy produces sample sets similar to those from more sophisticated approaches (e.g ensemble networks). Finally, we show the effectiveness of the selected samples by creating tiny high-quality datasets, and utilizing them for fast and cheap training of large models. Based on the above, we propose a simple baseline for deep active text classification that outperforms the state-of-the-art. We expect the presented work to be useful and informative for dataset compression and for problems involving active, semi-supervised or online learning scenarios. Code and models are available at: https://github.com/drimpossible/Sampling-Bias-Active-Learning

Tasks

Active Learning Classification General Classification text-classification Text Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AG News	ULMFiT (Small data)	Error	6.3	—	Unverified
Amazon-2	ULMFiT (Small data)	Error	3.9	—	Unverified
Amazon-5	ULMFiT (Small data)	Error	35.9	—	Unverified
DBpedia	ULMFiT (Small data)	Error	0.8	—	Unverified
Sogou News	ULMFiT (Small data)	Accuracy	97	—	Unverified
Yahoo! Answers	ULMFiT (Small data)	Accuracy	74.3	—	Unverified
Yelp-2	ULMFiT (Small data)	Accuracy	97.1	—	Unverified
Yelp-5	ULMFiT (Small data)	Accuracy	67.6	—	Unverified

Sampling Bias in Deep Active Classification: An Empirical Study

Code

Abstract

Tasks

Benchmark Results

Reproductions