SOTAVerified

Dialogue Evaluation

Papers

Showing 5197 of 97 papers

TitleStatusHype
CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge0
Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis0
Treating Dialogue Quality Evaluation as an Anomaly Detection Problem0
U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation0
User Response and Sentiment Prediction for Automatic Dialogue Evaluation0
WeChat AI & ICT's Submission for DSTC9 Interactive Dialogue Evaluation Track0
FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows0
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings0
How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation0
How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning0
xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation BenchmarkCode0
Achieving Reliable Human Assessment of Open-Domain Dialogue SystemsCode0
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue EvaluatorsCode0
Adversarial Learning for Neural Dialogue GenerationCode0
A Human-machine Collaborative Framework for Evaluating Malevolence in DialoguesCode0
An Adversarially-Learned Turing Test for Dialog Generation ModelsCode0
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog SystemsCode0
BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response GenerationCode0
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue EvaluationCode0
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic ManipulationsCode0
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue SystemsCode0
ECoh: Turn-level Coherence Evaluation for Multilingual DialoguesCode0
Evaluating Coherence in Dialogue Systems using EntailmentCode0
Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue EvaluationCode0
GCDF1: A Goal- and Context- Driven F-Score for Evaluating User ModelsCode0
Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation ModelCode0
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference AugmentationCode0
Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple ReferencesCode0
MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue EvaluationCode0
Measuring the Robustness of Reference-Free Dialogue Evaluation SystemsCode0
MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue EvaluatorsCode0
Methods for Recognizing Nested TermsCode0
PairEval: Open-domain Dialogue Evaluation with Pairwise ComparisonCode0
Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue SystemsCode0
Proxy Indicators for the Quality of Open-domain DialoguesCode0
RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News TextsCode0
SelF-Eval: Self-supervised Fine-grained Dialogue EvaluationCode0
Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue EvaluationCode0
SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues EvaluationCode0
Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMsCode0
Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue EvaluationCode0
Synthesizing Adversarial Negative Responses for Robust Response Ranking and EvaluationCode0
Towards an Automatic Turing Test: Learning to Evaluate Dialogue ResponsesCode0
Towards Multilingual Automatic Dialogue EvaluationCode0
Transformers for Headline Selection for Russian News ClustersCode0
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog EvaluationCode0
Towards Best Experiment Design for Evaluating Dialogue System OutputCode0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified