USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Shikib Mehri, Maxine Eskenazi
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/shikib/usrOfficialIn paperpytorch★ 50
Abstract
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| USR-PersonaChat | USR - DR (x = f) | Spearman Correlation | -0.05 | — | Unverified |
| USR-PersonaChat | USR - DR (x = c) | Spearman Correlation | 0.48 | — | Unverified |
| USR-PersonaChat | USR | Spearman Correlation | 0.47 | — | Unverified |
| USR-PersonaChat | USR - MLM | Spearman Correlation | 0.08 | — | Unverified |
| USR-TopicalChat | USR | Spearman Correlation | 0.42 | — | Unverified |
| USR-TopicalChat | USR - DR (x = c) | Spearman Correlation | 0.32 | — | Unverified |
| USR-TopicalChat | USR - MLM | Spearman Correlation | 0.31 | — | Unverified |
| USR-TopicalChat | USR - DR (x = f) | Spearman Correlation | 0.14 | — | Unverified |