SOTAVerified

Dialogue Evaluation

Papers

Showing 2130 of 97 papers

TitleStatusHype
DialogBench: Evaluating LLMs as Human-like Dialogue SystemsCode1
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation BenchmarkCode0
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue0
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue EvaluationCode0
Towards Multilingual Automatic Dialogue EvaluationCode0
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue EvaluationCode0
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue EvaluationCode0
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation0
DEnsity: Open-domain Dialogue Evaluation Metric using Density EstimationCode1
U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation0
Show:102550
← PrevPage 3 of 10Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified