| DCH-2: A Parallel Customer-Helpdesk Dialogue Corpus with Distributions of Annotators' Labels | Apr 18, 2021 | Dialogue EvaluationMachine Translation | —Unverified | 0 | 0 |
| FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows | Feb 14, 2022 | Dialogue Evaluation | —Unverified | 0 | 0 |
| How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation | May 23, 2023 | ChatbotDialogue Evaluation | —Unverified | 0 | 0 |
| How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning | Dec 10, 2019 | Continual LearningDialogue Evaluation | —Unverified | 0 | 0 |
| Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents | Jan 12, 2022 | Dialogue EvaluationSensitivity | —Unverified | 0 | 0 |
| Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings | Apr 24, 2019 | Dialogue Evaluationvalid | —Unverified | 0 | 0 |
| Explaining Dialogue Evaluation Metrics using Adversarial Behavioral Analysis | Jul 1, 2022 | Dialogue Evaluation | —Unverified | 0 | 0 |
| Improving Open-Domain Dialogue Evaluation with a Causal Inference Model | Jan 31, 2023 | Causal Inferencecounterfactual | —Unverified | 0 | 0 |
| Enhancing the Open-Domain Dialogue Evaluation in Latent Space | Aug 1, 2021 | Dialogue Evaluation | —Unverified | 0 | 0 |
| CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge | Mar 13, 2024 | Dialogue EvaluationHumanEval | —Unverified | 0 | 0 |
| Investigating the Impact of Pre-trained Language Models on Dialog Evaluation | Oct 5, 2021 | Dialogue EvaluationOpen-Domain Dialog | —Unverified | 0 | 0 |
| Joint Goal Segmentation and Goal Success Prediction on Multi-Domain Conversations | Oct 1, 2022 | Dialogue EvaluationMulti-Task Learning | —Unverified | 0 | 0 |
| DRE: An Effective Dual-Refined Method for Integrating Small and Large Language Models in Open-Domain Dialogue Evaluation | Jun 4, 2025 | Dialogue Evaluationvalid | —Unverified | 0 | 0 |
| Learning the Human Judgment for the Automatic Evaluation of Chatbot | May 1, 2020 | ChatbotDialogue Evaluation | —Unverified | 0 | 0 |
| LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation | May 26, 2025 | Dialogue Evaluation | —Unverified | 0 | 0 |
| Leveraging LLMs for Dialogue Quality Measurement | Jun 25, 2024 | Dialogue Evaluation | —Unverified | 0 | 0 |
| LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation | Jun 5, 2024 | Dialogue EvaluationSensitivity | —Unverified | 0 | 0 |
| MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation | May 27, 2025 | Dialogue Evaluation | —Unverified | 0 | 0 |
| Achieving Reliable Human Assessment of Open-Domain Dialogue Systems | Sep 17, 2021 | Dialogue Evaluation | —Unverified | 0 | 0 |
| AdaCoach: A Virtual Coach for Training Customer Service Agents | Apr 27, 2022 | Dialogue Evaluation | —Unverified | 0 | 0 |
| WeChat AI & ICT's Submission for DSTC9 Interactive Dialogue Evaluation Track | Jan 20, 2021 | Dialogue EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Treating Dialogue Quality Evaluation as an Anomaly Detection Problem | May 1, 2020 | Anomaly DetectionDialogue Evaluation | —Unverified | 0 | 0 |