Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models

2019-07-01ACL 2019Unverified0· sign in to hype

Ruibo Wang, Jihong Li

Unverified — Be the first to reproduce this paper.

Abstract

Direct comparison on point estimation of the precision (P), recall (R), and F1 measure of two natural language processing (NLP) models on a common test corpus is unreasonable and results in less replicable conclusions due to a lack of a statistical test. However, the existing t-tests in cross-validation (CV) for model comparison are inappropriate because the distributions of P, R, F1 are skewed and an interval estimation of P, R, and F1 based on a t-test may exceed [0,1]. In this study, we propose to use a block-regularized 32 CV (32 BCV) in model comparison because it could regularize the difference in certain frequency distributions over linguistic units between training and validation sets and yield stable estimators of P, R, and F1. On the basis of the 32 BCV, we calibrate the posterior distributions of P, R, and F1 and derive an accurate interval estimation of P, R, and F1. Furthermore, we formulate the comparison into a hypothesis testing problem and propose a novel Bayes test. The test could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests. Three experiments with regard to NLP chunking tasks are conducted, and the results illustrate the validity of the Bayes test.

Tasks

Chunking Two-sample testing

Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models

Abstract

Tasks

Reproductions