| NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist | May 15, 2023 | Controllable Language ModellingDialogue Generation | CodeCode Available | 3 |
| Towards a Unified Multi-Dimensional Evaluator for Text Generation | Oct 13, 2022 | nlg evaluationQuestion Answering | CodeCode Available | 2 |
| Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis | Oct 10, 2022 | AllImage Captioning | CodeCode Available | 1 |
| Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation | Sep 14, 2021 | nlg evaluationStyle Transfer | CodeCode Available | 1 |
| Leveraging Large Language Models for NLG Evaluation: Advances and Challenges | Jan 13, 2024 | nlg evaluationSpecificity | CodeCode Available | 1 |
| Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory | May 24, 2023 | nlg evaluationText Generation | CodeCode Available | 1 |
| G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment | Mar 29, 2023 | Dialogue GenerationDiversity | CodeCode Available | 1 |
| LUNA: A Framework for Language Understanding and Naturalness Assessment | Jan 9, 2024 | nlg evaluationText Generation | CodeCode Available | 1 |
| Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability | Jun 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons | Mar 11, 2022 | nlg evaluation | CodeCode Available | 1 |
| Is ChatGPT a Good NLG Evaluator? A Preliminary Study | Mar 7, 2023 | nlg evaluationStory Generation | CodeCode Available | 1 |
| CoAScore: Chain-of-Aspects Prompting for NLG Evaluation | Dec 16, 2023 | nlg evaluationResponse Generation | —Unverified | 0 |
| Treat the system like a human student: Automatic naturalness evaluation of generated text without reference texts | Nov 1, 2018 | Image CaptioningMachine Translation | —Unverified | 0 |
| WaterJudge: Quality-Detection Trade-off when Watermarking Large Language Models | Mar 28, 2024 | nlg evaluation | —Unverified | 0 |
| A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation -- through the Lens of Semantic Similarity Rating | May 24, 2022 | nlg evaluationSemantic Similarity | —Unverified | 0 |
| X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects | Nov 15, 2023 | Dialogue GenerationLanguage Modelling | —Unverified | 0 |
| Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons | Jun 16, 2021 | nlg evaluation | —Unverified | 0 |
| A Snapshot of NLG Evaluation Practices 2005 - 2014 | Sep 1, 2015 | nlg evaluationText Generation | —Unverified | 0 |
| Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications | May 13, 2022 | nlg evaluationText Generation | —Unverified | 0 |
| DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? | Apr 10, 2025 | Machine Translationnlg evaluation | —Unverified | 0 |
| A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating | Jul 1, 2022 | nlg evaluationSemantic Similarity | —Unverified | 0 |
| A Survey of Evaluation Metrics Used for NLG Systems | Aug 27, 2020 | Image Captioningnlg evaluation | —Unverified | 0 |
| DHP Benchmark: Are LLMs Good NLG Evaluators? | Aug 25, 2024 | Benchmarkingnlg evaluation | —Unverified | 0 |
| Dialect-robust Evaluation of Generated Text | Nov 2, 2022 | nlg evaluation | —Unverified | 0 |
| Dolphin: A Challenging and Diverse Benchmark for Arabic NLG | May 24, 2023 | Dialogue GenerationDiversity | —Unverified | 0 |
| NLG-Metricverse: An End-to-End Library for Evaluating Natural Language Generation | Oct 1, 2022 | Managementnlg evaluation | —Unverified | 0 |
| Evaluation of Text Generation: A Survey | Jun 26, 2020 | nlg evaluationSurvey | —Unverified | 0 |
| Evaluation rules! On the use of grammars and rule-based systems for NLG evaluation | Dec 1, 2020 | nlg evaluationPosition | —Unverified | 0 |
| Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators | Mar 6, 2025 | nlg evaluation | —Unverified | 0 |
| The Authenticity Gap in Human Evaluation | May 24, 2022 | nlg evaluationSingle Particle Analysis | —Unverified | 0 |
| ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation | Jun 10, 2021 | nlg evaluationText Generation | —Unverified | 0 |
| ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation | Dec 17, 2021 | nlg evaluationText Generation | —Unverified | 0 |
| Language Model Augmented Relevance Score | Aug 19, 2021 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Large Language Models Are Active Critics in NLG Evaluation | Oct 14, 2024 | nlg evaluationPrompt Engineering | —Unverified | 0 |
| LLM-based NLG Evaluation: Current Status and Challenges | Feb 2, 2024 | nlg evaluationText Generation | —Unverified | 0 |
| A Survey of Natural Language Generation | Dec 22, 2021 | Data-to-Text GenerationDeep Learning | —Unverified | 0 |
| MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation | Jul 24, 2021 | Diversitynlg evaluation | —Unverified | 0 |
| A Tutorial on Evaluation Metrics used in Natural Language Generation | Jun 1, 2021 | nlg evaluationText Generation | —Unverified | 0 |
| Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking | Nov 8, 2024 | Fact Checkingnlg evaluation | —Unverified | 0 |
| Agreement is overrated: A plea for correlation to assess human evaluation reliability | Oct 1, 2019 | nlg evaluation | —Unverified | 0 |
| Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts | Apr 29, 2025 | AllDiversity | —Unverified | 0 |
| All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text | Jun 30, 2021 | AllArticles | —Unverified | 0 |
| All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text | Aug 1, 2021 | AllArticles | —Unverified | 0 |
| Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text | Feb 14, 2022 | nlg evaluationText Generation | —Unverified | 0 |
| Rethinking Model Evaluation as Narrowing the Socio-Technical Gap | Jun 1, 2023 | Explainable Artificial Intelligence (XAI)nlg evaluation | —Unverified | 0 |
| SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation | Mar 29, 2017 | nlg evaluationSurvey | —Unverified | 0 |
| The Pitfalls of Defining Hallucination | Jan 15, 2024 | Hallucinationnlg evaluation | —Unverified | 0 |
| The use of rating and Likert scales in Natural Language Generation human evaluation tasks: A review and some recommendations | Oct 1, 2019 | nlg evaluationText Generation | —Unverified | 0 |
| LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models | Jul 15, 2023 | nlg evaluationResponse Generation | CodeCode Available | 0 |