SOTAVerified

Dialogue Evaluation

Papers

Showing 2650 of 97 papers

TitleStatusHype
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems0
Improving Open-Domain Dialogue Evaluation with a Causal Inference Model0
Investigating the Impact of Pre-trained Language Models on Dialog Evaluation0
Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations0
Learning the Human Judgment for the Automatic Evaluation of Chatbot0
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation0
Leveraging LLMs for Dialogue Quality Measurement0
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation0
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation0
DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels0
AdaCoach: A Virtual Coach for Training Customer Service Agents0
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons0
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue0
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning0
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment0
Pragmatically Appropriate Diversity for Dialogue Evaluation0
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers0
Dialogue Evaluation with Offline Reinforcement Learning0
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue0
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses0
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges0
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations0
DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation0
Enhancing the Open-Domain Dialogue Evaluation in Latent Space0
Show:102550
← PrevPage 2 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified