| DialogBench: Evaluating LLMs as Human-like Dialogue Systems | Nov 3, 2023 | Dialogue Evaluation | CodeCode Available | 1 |
| xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark | Oct 13, 2023 | Dialogue EvaluationMachine Translation | CodeCode Available | 0 |
| RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue | Sep 15, 2023 | Dialogue EvaluationMulti-Task Learning | —Unverified | 0 |
| Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation | Sep 14, 2023 | ChatbotDialogue Evaluation | CodeCode Available | 0 |
| Towards Multilingual Automatic Dialogue Evaluation | Aug 31, 2023 | Dialogue EvaluationMachine Translation | CodeCode Available | 0 |
| Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation | Aug 31, 2023 | Dialogue Evaluation | CodeCode Available | 0 |
| C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation | Jun 27, 2023 | Dialogue Evaluation | CodeCode Available | 0 |
| How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation | May 23, 2023 | ChatbotDialogue Evaluation | —Unverified | 0 |
| DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation | May 8, 2023 | Contrastive LearningDensity Estimation | CodeCode Available | 1 |
| U-NEED: A Fine-grained Dataset for User Needs-Centric E-commerce Conversational Recommendation | May 5, 2023 | Conversational RecommendationDialogue Evaluation | —Unverified | 0 |