SOTAVerified

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

2020-05-01ACL 2020Code Available1· sign in to hype

Shikib Mehri, Maxine Eskenazi

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
USR-PersonaChatUSR - DR (x = f)Spearman Correlation-0.05Unverified
USR-PersonaChatUSR - DR (x = c)Spearman Correlation0.48Unverified
USR-PersonaChatUSRSpearman Correlation0.47Unverified
USR-PersonaChatUSR - MLMSpearman Correlation0.08Unverified
USR-TopicalChatUSRSpearman Correlation0.42Unverified
USR-TopicalChatUSR - DR (x = c)Spearman Correlation0.32Unverified
USR-TopicalChatUSR - MLMSpearman Correlation0.31Unverified
USR-TopicalChatUSR - DR (x = f)Spearman Correlation0.14Unverified

Reproductions