| Achieving Reliable Human Assessment of Open-Domain Dialogue Systems | Sep 17, 2021 | Dialogue Evaluation | —Unverified | 0 |
| Improving Open-Domain Dialogue Evaluation with a Causal Inference Model | Jan 31, 2023 | Causal Inferencecounterfactual | —Unverified | 0 |
| Investigating the Impact of Pre-trained Language Models on Dialog Evaluation | Oct 5, 2021 | Dialogue EvaluationOpen-Domain Dialog | —Unverified | 0 |
| Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations | Oct 1, 2022 | Dialogue EvaluationMulti-Task Learning | —Unverified | 0 |
| Learning the Human Judgment for the Automatic Evaluation of Chatbot | May 1, 2020 | ChatbotDialogue Evaluation | —Unverified | 0 |
| LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation | May 26, 2025 | Dialogue Evaluation | —Unverified | 0 |
| Leveraging LLMs for Dialogue Quality Measurement | Jun 25, 2024 | Dialogue Evaluation | —Unverified | 0 |
| LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation | Jun 5, 2024 | Dialogue EvaluationSensitivity | —Unverified | 0 |
| MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation | May 27, 2025 | Dialogue Evaluation | —Unverified | 0 |
| DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels | Apr 18, 2021 | Dialogue EvaluationMachine Translation | —Unverified | 0 |
| AdaCoach: A Virtual Coach for Training Customer Service Agents | Apr 27, 2022 | Dialogue Evaluation | —Unverified | 0 |
| ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons | Sep 6, 2019 | Dialogue Evaluation | —Unverified | 0 |
| MME-CRS: Multi-Metric Evaluation Based on Correlation Re-Scaling for Evaluating Open-Domain Dialogue | Jun 19, 2022 | Dialogue EvaluationMME | —Unverified | 0 |
| One "Ruler" for All Languages: Multi-Lingual Dialogue Evaluation with Adversarial Multi-Task Learning | May 8, 2018 | AllDialogue Evaluation | —Unverified | 0 |
| On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation | Jul 4, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| PoE: a Panel of Experts for Generalized Automatic Dialogue Assessment | Dec 18, 2022 | Data AugmentationDialogue Evaluation | —Unverified | 0 |
| Pragmatically Appropriate Diversity for Dialogue Evaluation | Apr 6, 2023 | Dialogue EvaluationDiversity | —Unverified | 0 |
| Predicting Ratings of Real Dialogue Participants from Artificial Data and Ratings of Human Dialogue Observers | May 1, 2020 | Dialogue Evaluation | —Unverified | 0 |
| Dialogue Evaluation with Offline Reinforcement Learning | Sep 2, 2022 | Dialogue EvaluationOffline RL | —Unverified | 0 |
| RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue | Sep 15, 2023 | Dialogue EvaluationMulti-Task Learning | —Unverified | 0 |
| Re-evaluating ADEM: A Deeper Look at Scoring Dialogue Responses | Feb 23, 2019 | Dialogue EvaluationResponse Generation | —Unverified | 0 |
| Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges | Mar 18, 2022 | Dialogue Evaluation | —Unverified | 0 |
| Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations | Sep 3, 2024 | Dialogue Evaluation | —Unverified | 0 |
| DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation | Jun 4, 2025 | Dialogue Evaluationvalid | —Unverified | 0 |
| Enhancing the Open-Domain Dialogue Evaluation in Latent Space | Aug 1, 2021 | Dialogue Evaluation | —Unverified | 0 |