SOTAVerified

Dialogue Evaluation

Papers

Showing 5175 of 97 papers

TitleStatusHype
SelF-Eval: Self-supervised Fine-grained Dialogue EvaluationCode0
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue EvaluationCode0
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues EvaluationCode0
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsCode0
Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue EvaluationCode0
Synthesizing Adversarial Negative Responses for Robust Response Ranking and EvaluationCode0
Towards an Automatic Turing Test: Learning to Evaluate Dialogue ResponsesCode0
Towards Multilingual Automatic Dialogue EvaluationCode0
Transformers for Headline Selection for Russian News ClustersCode0
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog EvaluationCode0
Towards Best Experiment Design for Evaluating Dialogue System OutputCode0
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue0
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning0
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation0
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment0
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations0
Pragmatically Appropriate Diversity for Dialogue Evaluation0
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers0
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons0
User Response and Sentiment Prediction for Automatic Dialogue Evaluation0
Dialogue Evaluation with Offline Reinforcement Learning0
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue0
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses0
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges0
Show:102550
← PrevPage 3 of 4Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified