| MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators | May 28, 2025 | BenchmarkingChatbot | CodeCode Available | 0 |
| MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation | May 27, 2025 | Dialogue Evaluation | —Unverified | 0 |
| LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation | May 26, 2025 | Dialogue Evaluation | —Unverified | 0 |
| Methods for Recognizing Nested Terms | Apr 22, 2025 | Dialogue Evaluationnamed-entity-recognition | CodeCode Available | 0 |
| RuOpinionNE-2024: Extraction of Opinion Tuples from Russian News Texts | Apr 9, 2025 | Dialogue EvaluationLanguage Modeling | CodeCode Available | 0 |
| BoK: Introducing Bag-of-Keywords Loss for Interpretable Dialogue Response Generation | Jan 17, 2025 | DecoderDialogue Evaluation | CodeCode Available | 0 |
| Measuring the Robustness of Reference-Free Dialogue Evaluation Systems | Jan 12, 2025 | Dialogue EvaluationTAG | CodeCode Available | 0 |
| Dialogue You Can Trust: Human and AI Perspectives on Generated Conversations | Sep 3, 2024 | Dialogue Evaluation | —Unverified | 0 |
| Soda-Eval: Open-Domain Dialogue Evaluation in the age of LLMs | Aug 20, 2024 | Dialogue Evaluation | CodeCode Available | 0 |
| ECoh: Turn-level Coherence Evaluation for Multilingual Dialogues | Jul 16, 2024 | Coherence EvaluationDialogue Evaluation | CodeCode Available | 0 |
| On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation | Jul 4, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Leveraging LLMs for Dialogue Quality Measurement | Jun 25, 2024 | Dialogue Evaluation | —Unverified | 0 |
| LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation | Jun 5, 2024 | Dialogue EvaluationSensitivity | —Unverified | 0 |
| SLIDE: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation | May 24, 2024 | Contrastive LearningDialogue Evaluation | CodeCode Available | 0 |
| PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison | Apr 1, 2024 | Dialogue Evaluation | CodeCode Available | 0 |
| Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation | Apr 1, 2024 | Abstract Meaning RepresentationDialogue Evaluation | CodeCode Available | 0 |
| CodingTeachLLM: Empowering LLM's Coding Ability via AST Prior Knowledge | Mar 13, 2024 | Dialogue EvaluationHumanEval | —Unverified | 0 |
| A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators | Dec 24, 2023 | Dialogue Evaluation | CodeCode Available | 0 |
| xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark | Oct 13, 2023 | Dialogue EvaluationMachine Translation | CodeCode Available | 0 |
| RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue | Sep 15, 2023 | Dialogue EvaluationMulti-Task Learning | —Unverified | 0 |
| Exploring the Impact of Human Evaluator Group on Chat-Oriented Dialogue Evaluation | Sep 14, 2023 | ChatbotDialogue Evaluation | CodeCode Available | 0 |
| Simple LLM Prompting is State-of-the-Art for Robust and Multilingual Dialogue Evaluation | Aug 31, 2023 | Dialogue Evaluation | CodeCode Available | 0 |
| Towards Multilingual Automatic Dialogue Evaluation | Aug 31, 2023 | Dialogue EvaluationMachine Translation | CodeCode Available | 0 |
| C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation | Jun 27, 2023 | Dialogue Evaluation | CodeCode Available | 0 |
| How to Choose How to Choose Your Chatbot: A Massively Multi-System MultiReference Data Set for Dialog Metric Evaluation | May 23, 2023 | ChatbotDialogue Evaluation | —Unverified | 0 |