SOTAVerified

Dialogue Evaluation

Papers

Showing 150 of 97 papers

TitleStatusHype
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language ModelsCode2
Assessing Dialogue Systems with Distribution DistancesCode1
GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue GenerationCode1
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue SystemsCode1
DialogBench: Evaluating LLMs as Human-like Dialogue SystemsCode1
DEnsity: Open-domain Dialogue Evaluation Metric using Density EstimationCode1
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog SystemsCode1
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue GenerationCode1
Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale PretrainingCode1
Automatic Evaluation and Moderation of Open-domain Dialogue SystemsCode1
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction TuningCode1
Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue UtterancesCode1
Towards Quantifiable Dialogue Coherence EvaluationCode1
Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in RussianCode1
Learning an Unreferenced Metric for Online Dialogue EvaluationCode1
RuNNE-2022 Shared Task: Recognizing Nested Named EntitiesCode1
Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question AnsweringCode1
Unsupervised Evaluation of Interactive Dialog with DialoGPTCode1
Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue SystemsCode1
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog GenerationCode1
A Comprehensive Assessment of Dialog Evaluation MetricsCode1
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue SystemsCode1
DynaEval: Unifying Turn and Dialogue Level EvaluationCode1
FineD-Eval: Fine-grained Automatic Dialogue-Level EvaluationCode1
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents0
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems0
Improving Open-Domain Dialogue Evaluation with a Causal Inference Model0
Investigating the Impact of Pre-trained Language Models on Dialog Evaluation0
Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations0
Learning the Human Judgment for the Automatic Evaluation of Chatbot0
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation0
Leveraging LLMs for Dialogue Quality Measurement0
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation0
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation0
DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels0
AdaCoach: A Virtual Coach for Training Customer Service Agents0
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons0
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue0
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning0
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment0
Pragmatically Appropriate Diversity for Dialogue Evaluation0
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers0
Dialogue Evaluation with Offline Reinforcement Learning0
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue0
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses0
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges0
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations0
DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation0
Enhancing the Open-Domain Dialogue Evaluation in Latent Space0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified