SOTAVerified

Dialogue Evaluation

Papers

Showing 5197 of 97 papers

TitleStatusHype
U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation0
Pragmatically Appropriate Diversity for Dialogue Evaluation0
Improving Open-Domain Dialogue Evaluation with a Causal Inference Model0
PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment0
Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations0
Dialogue Evaluation with Offline Reinforcement Learning0
SelF-Eval: Self-supervised Fine-grained Dialogue EvaluationCode0
Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis0
MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue0
AdaCoach: A Virtual Coach for Training Customer Service Agents0
What is wrong with you?: Leveraging User Sentiment for Automatic Dialog EvaluationCode0
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges0
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic ManipulationsCode0
Achieving Reliable Human Assessment of Open-Domain Dialogue SystemsCode0
FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows0
Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents0
MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue EvaluationCode0
User Response and Sentiment Prediction for Automatic Dialogue Evaluation0
GCDF1: A Goal- and Context- Driven F-Score for Evaluating User ModelsCode0
Proxy Indicators for the Quality of Open-domain DialoguesCode0
Investigating the Impact of Pre-trained Language Models on Dialog Evaluation0
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems0
A Human-machine Collaborative Framework for Evaluating Malevolence in DialoguesCode0
Enhancing the Open-Domain Dialogue Evaluation in Latent Space0
Transformers for Headline Selection for Russian News ClustersCode0
Synthesizing Adversarial Negative Responses for Robust Response Ranking and EvaluationCode0
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference AugmentationCode0
Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation ModelCode0
DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels0
An Adversarially-Learned Turing Test for Dialog Generation ModelsCode0
WeChat AI & ICT's Submission for DSTC9 Interactive Dialogue Evaluation Track0
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue SystemsCode0
Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers0
Learning the Human Judgment for the Automatic Evaluation of Chatbot0
Treating Dialogue Quality Evaluation as an Anomaly Detection Problem0
How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning0
Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue SystemsCode0
Towards Best Experiment Design for Evaluating Dialogue System OutputCode0
ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons0
Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple ReferencesCode0
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog SystemsCode0
Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings0
Evaluating Coherence in Dialogue Systems using EntailmentCode0
Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses0
One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning0
Towards an Automatic Turing Test: Learning to Evaluate Dialogue ResponsesCode0
Adversarial Learning for Neural Dialogue GenerationCode0
Show:102550
← PrevPage 2 of 2Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1MDD-EvalSpearman Correlation0.51Unverified
2Lin-Reg (all)Spearman Correlation0.49Unverified
3USRSpearman Correlation0.42Unverified
4USR - DR (x = c)Spearman Correlation0.32Unverified
5USR - MLMSpearman Correlation0.31Unverified
6USR - DR (x = f)Spearman Correlation0.14Unverified
#ModelMetricClaimedVerifiedStatus
1Lin-Reg (all)Spearman Correlation0.54Unverified
2USR - DR (x = c)Spearman Correlation0.48Unverified
3USRSpearman Correlation0.47Unverified
4USR - MLMSpearman Correlation0.08Unverified
5USR - DR (x = f)Spearman Correlation-0.05Unverified