| Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models | Apr 7, 2025 | Dialogue EvaluationFairness | CodeCode Available | 2 | 5 |
| PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems | Apr 6, 2020 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems | Dec 18, 2022 | ChatbotDialogue Evaluation | CodeCode Available | 1 | 5 |
| Learning an Unreferenced Metric for Online Dialogue Evaluation | May 1, 2020 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| DynaEval: Unifying Turn and Dialogue Level Evaluation | Jun 2, 2021 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| Assessing Dialogue Systems with Distribution Distances | May 6, 2021 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning | May 25, 2022 | Dialogue EvaluationDialogue Generation | CodeCode Available | 1 | 5 |
| Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation | Jul 1, 2020 | Dialogue EvaluationDialogue Generation | CodeCode Available | 1 | 5 |
| Improving Dialog Evaluation with a Multi-reference Adversarial Dataset and Large Scale Pretraining | Sep 23, 2020 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| Automatic Evaluation and Moderation of Open-domain Dialogue Systems | Nov 3, 2021 | ChatbotDialogue Evaluation | CodeCode Available | 1 | 5 |
| Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian | Jun 3, 2022 | Binary ClassificationDialogue Evaluation | CodeCode Available | 1 | 5 |
| FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation | Oct 25, 2022 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems | Oct 8, 2020 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| RuNNE-2022 Shared Task: Recognizing Nested Named Entities | May 23, 2022 | Dialogue Evaluationnamed-entity-recognition | CodeCode Available | 1 | 5 |
| GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation | Feb 28, 2023 | Dialogue EvaluationDialogue Generation | CodeCode Available | 1 | 5 |
| USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation | May 1, 2020 | Dialogue EvaluationOpen-Domain Dialog | CodeCode Available | 1 | 5 |
| Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue Utterances | Jun 4, 2021 | ChatbotDialogue Evaluation | CodeCode Available | 1 | 5 |
| A Comprehensive Assessment of Dialog Evaluation Metrics | Jun 7, 2021 | Dialogue EvaluationResponse Generation | CodeCode Available | 1 | 5 |
| RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems | Jan 11, 2017 | Dialogue EvaluationOpen-Domain Dialog | CodeCode Available | 1 | 5 |
| Unsupervised Evaluation of Interactive Dialog with DialoGPT | Jun 23, 2020 | Dialogue EvaluationOpen-Domain Dialog | CodeCode Available | 1 | 5 |
| Towards Quantifiable Dialogue Coherence Evaluation | Jun 1, 2021 | Coherence EvaluationDialogue Evaluation | CodeCode Available | 1 | 5 |
| DEnsity: Open-domain Dialogue Evaluation Metric using Density Estimation | May 8, 2023 | Contrastive LearningDensity Estimation | CodeCode Available | 1 | 5 |
| Q^2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering | Apr 16, 2021 | Abstractive Text SummarizationDialogue Evaluation | CodeCode Available | 1 | 5 |
| DialogBench: Evaluating LLMs as Human-like Dialogue Systems | Nov 3, 2023 | Dialogue Evaluation | CodeCode Available | 1 | 5 |
| xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark | Oct 13, 2023 | Dialogue EvaluationMachine Translation | CodeCode Available | 0 | 5 |
| Achieving Reliable Human Assessment of Open-Domain Dialogue Systems | Mar 11, 2022 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators | Dec 24, 2023 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Adversarial Learning for Neural Dialogue Generation | Jan 23, 2017 | Dialogue EvaluationDialogue Generation | CodeCode Available | 0 | 5 |
| A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues | Aug 1, 2021 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| An Adversarially-Learned Turing Test for Dialog Generation Models | Apr 16, 2021 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems | Jun 21, 2019 | Dialogue EvaluationKnowledge Distillation | CodeCode Available | 0 | 5 |
| BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation | Jan 17, 2025 | DecoderDialogue Evaluation | CodeCode Available | 0 | 5 |
| C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation | Jun 27, 2023 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations | Mar 18, 2022 | Abstract Meaning RepresentationCoherence Evaluation | CodeCode Available | 0 | 5 |
| Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems | Nov 1, 2020 | Dialogue EvaluationSemantic Similarity | CodeCode Available | 0 | 5 |
| ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues | Jul 16, 2024 | Coherence EvaluationDialogue Evaluation | CodeCode Available | 0 | 5 |
| Evaluating Coherence in Dialogue Systems using Entailment | Apr 6, 2019 | Dialogue EvaluationDiversity | CodeCode Available | 0 | 5 |
| Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation | Sep 14, 2023 | ChatbotDialogue Evaluation | CodeCode Available | 0 | 5 |
| GCDF1: A Goal- and Context- Driven F-Score for Evaluating User Models | Nov 1, 2021 | Dialogue EvaluationTask-Oriented Dialogue Systems | CodeCode Available | 0 | 5 |
| Generating Negative Samples by Manipulating Golden Responses for Unsupervised Learning of a Response Evaluation Model | Jun 1, 2021 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation | Jun 5, 2021 | Dialogue EvaluationOpen-Domain Dialog | CodeCode Available | 0 | 5 |
| Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References | Jul 24, 2019 | Dialogue EvaluationDiversity | CodeCode Available | 0 | 5 |
| MDD-Eval: Self-Training on Augmented Data for Multi-Domain Dialogue Evaluation | Dec 14, 2021 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Measuring the Robustness of Reference-Free Dialogue Evaluation Systems | Jan 12, 2025 | Dialogue EvaluationTAG | CodeCode Available | 0 | 5 |
| MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators | May 28, 2025 | BenchmarkingChatbot | CodeCode Available | 0 | 5 |
| Methods for Recognizing Nested Terms | Apr 22, 2025 | Dialogue Evaluationnamed-entity-recognition | CodeCode Available | 0 | 5 |
| PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison | Apr 1, 2024 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems | Nov 4, 2019 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| Proxy Indicators for the Quality of Open-domain Dialogues | Nov 1, 2021 | Dialogue Evaluation | CodeCode Available | 0 | 5 |
| RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts | Apr 9, 2025 | Dialogue EvaluationLanguage Modeling | CodeCode Available | 0 | 5 |