USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

2020-05-01ACL 2020Code Available1· sign in to hype

Shikib Mehri, Maxine Eskenazi

Code Available — Be the first to reproduce this paper.

Code

github.com/shikib/usr
OfficialIn paperpytorch★ 50

Abstract

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Tasks

Dialogue Evaluation Open-Domain Dialog Text Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
USR-PersonaChat	USR - DR (x = f)	Spearman Correlation	-0.05	—	Unverified
USR-PersonaChat	USR - DR (x = c)	Spearman Correlation	0.48	—	Unverified
USR-PersonaChat	USR	Spearman Correlation	0.47	—	Unverified
USR-PersonaChat	USR - MLM	Spearman Correlation	0.08	—	Unverified
USR-TopicalChat	USR	Spearman Correlation	0.42	—	Unverified
USR-TopicalChat	USR - DR (x = c)	Spearman Correlation	0.32	—	Unverified
USR-TopicalChat	USR - MLM	Spearman Correlation	0.31	—	Unverified
USR-TopicalChat	USR - DR (x = f)	Spearman Correlation	0.14	—	Unverified

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions