SOTAVerified

Dialogue Evaluation

Papers

Showing 150 of 97 papers

TitleStatusHype
Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language ModelsCode2
DialogBench: Evaluating LLMs as Human-like Dialogue SystemsCode1
DEnsity: Open-domain Dialogue Evaluation Metric using Density EstimationCode1
GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue GenerationCode1
Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue SystemsCode1
FineD-Eval: Fine-grained Automatic Dialogue-Level EvaluationCode1
Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in RussianCode1
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction TuningCode1
RuNNE-2022 Shared Task: Recognizing Nested Named EntitiesCode1
Automatic Evaluation and Moderation of Open-domain Dialogue SystemsCode1
A Comprehensive Assessment of Dialog Evaluation MetricsCode1
Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue UtterancesCode1
DynaEval: Unifying Turn and Dialogue Level EvaluationCode1
Towards Quantifiable Dialogue Coherence EvaluationCode1
Assessing Dialogue Systems with Distribution DistancesCode1
Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question AnsweringCode1
GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue SystemsCode1
Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale PretrainingCode1
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue GenerationCode1
Unsupervised Evaluation of Interactive Dialog with DialoGPTCode1
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog GenerationCode1
Learning an Unreferenced Metric for Online Dialogue EvaluationCode1
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue SystemsCode1
RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog SystemsCode1
DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue EvaluatorsCode0
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation0
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation0
Methods for Recognizing Nested TermsCode0
RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News TextsCode0
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response GenerationCode0
Measuring the Robustness of Reference-Free Dialogue Evaluation SystemsCode0
Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations0
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsCode0
ECoh: Turn-level Coherence Evaluation for Multilingual DialoguesCode0
On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation0
Leveraging LLMs for Dialogue Quality Measurement0
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation0
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues EvaluationCode0
PairEval: Open-domain Dialogue Evaluation with Pairwise ComparisonCode0
Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue EvaluationCode0
CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge0
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue EvaluatorsCode0
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation BenchmarkCode0
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue0
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue EvaluationCode0
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue EvaluationCode0
Towards Multilingual Automatic Dialogue EvaluationCode0
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue EvaluationCode0
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation0
Show:102550
← PrevPage 1 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified