| Long-Form Information Alignment Evaluation Beyond Atomic Facts | May 21, 2025 | Formnlg evaluation | CodeCode Available | 0 |
| Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts | Apr 29, 2025 | AllDiversity | —Unverified | 0 |
| DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization? | Apr 10, 2025 | Machine Translationnlg evaluation | —Unverified | 0 |
| OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs | Mar 14, 2025 | nlg evaluation | CodeCode Available | 0 |
| Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators | Mar 6, 2025 | nlg evaluation | —Unverified | 0 |
| SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text | Nov 25, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking | Nov 8, 2024 | Fact Checkingnlg evaluation | —Unverified | 0 |
| Analyzing and Evaluating Correlation Measures in NLG Meta-Evaluation | Oct 22, 2024 | nlg evaluation | CodeCode Available | 0 |
| Large Language Models Are Active Critics in NLG Evaluation | Oct 14, 2024 | nlg evaluationPrompt Engineering | —Unverified | 0 |
| DHP Benchmark: Are LLMs Good NLG Evaluators? | Aug 25, 2024 | Benchmarkingnlg evaluation | —Unverified | 0 |
| ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models | Jul 16, 2024 | nlg evaluationText Generation | CodeCode Available | 0 |
| Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability | Jun 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation | Jun 12, 2024 | nlg evaluationText Generation | CodeCode Available | 0 |
| Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling | Jun 12, 2024 | nlg evaluation | CodeCode Available | 0 |
| Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models | May 23, 2024 | nlg evaluationText Generation | CodeCode Available | 0 |
| DEBATE: Devil's Advocate-Based Assessment and Text Evaluation | May 16, 2024 | nlg evaluationText Generation | CodeCode Available | 0 |
| WaterJudge: Quality-Detection Trade-off when Watermarking Large Language Models | Mar 28, 2024 | nlg evaluation | —Unverified | 0 |
| Are LLM-based Evaluators Confusing NLG Quality Criteria? | Feb 19, 2024 | nlg evaluation | CodeCode Available | 0 |
| One Prompt To Rule Them All: LLMs for Opinion Summary Evaluation | Feb 18, 2024 | Allnlg evaluation | CodeCode Available | 0 |
| LLM-based NLG Evaluation: Current Status and Challenges | Feb 2, 2024 | nlg evaluationText Generation | —Unverified | 0 |
| The Pitfalls of Defining Hallucination | Jan 15, 2024 | Hallucinationnlg evaluation | —Unverified | 0 |
| Leveraging Large Language Models for NLG Evaluation: Advances and Challenges | Jan 13, 2024 | nlg evaluationSpecificity | CodeCode Available | 1 |
| LUNA: A Framework for Language Understanding and Naturalness Assessment | Jan 9, 2024 | nlg evaluationText Generation | CodeCode Available | 1 |
| CoAScore: Chain-of-Aspects Prompting for NLG Evaluation | Dec 16, 2023 | nlg evaluationResponse Generation | —Unverified | 0 |
| X-Eval: Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects | Nov 15, 2023 | Dialogue GenerationLanguage Modelling | —Unverified | 0 |
| Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation | Aug 6, 2023 | Diversitynlg evaluation | CodeCode Available | 0 |
| LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models | Jul 15, 2023 | nlg evaluationResponse Generation | CodeCode Available | 0 |
| DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering | Jul 13, 2023 | Dialogue Generationnlg evaluation | CodeCode Available | 0 |
| Rethinking Model Evaluation as Narrowing the Socio-Technical Gap | Jun 1, 2023 | Explainable Artificial Intelligence (XAI)nlg evaluation | —Unverified | 0 |
| Dolphin: A Challenging and Diverse Benchmark for Arabic NLG | May 24, 2023 | Dialogue GenerationDiversity | —Unverified | 0 |
| Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References | May 24, 2023 | AllMachine Translation | CodeCode Available | 0 |
| Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory | May 24, 2023 | nlg evaluationText Generation | CodeCode Available | 1 |
| NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist | May 15, 2023 | Controllable Language ModellingDialogue Generation | CodeCode Available | 3 |
| G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment | Mar 29, 2023 | Dialogue GenerationDiversity | CodeCode Available | 1 |
| Is ChatGPT a Good NLG Evaluator? A Preliminary Study | Mar 7, 2023 | nlg evaluationStory Generation | CodeCode Available | 1 |
| Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions | Mar 7, 2023 | nlg evaluationRepresentation Learning | CodeCode Available | 0 |
| CLSE: Corpus of Linguistically Significant Entities | Nov 4, 2022 | nlg evaluationText Generation | CodeCode Available | 0 |
| Dialect-robust Evaluation of Generated Text | Nov 2, 2022 | nlg evaluation | —Unverified | 0 |
| Towards a Unified Multi-Dimensional Evaluator for Text Generation | Oct 13, 2022 | nlg evaluationQuestion Answering | CodeCode Available | 2 |
| Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis | Oct 10, 2022 | AllImage Captioning | CodeCode Available | 1 |
| NLG-Metricverse: An End-to-End Library for Evaluating Natural Language Generation | Oct 1, 2022 | Managementnlg evaluation | —Unverified | 0 |
| EffEval: A Comprehensive Evaluation of Efficiency for MT Evaluation Metrics | Sep 20, 2022 | CPUGPU | CodeCode Available | 0 |
| A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation – through the Lens of Semantic Similarity Rating | Jul 1, 2022 | nlg evaluationSemantic Similarity | —Unverified | 0 |
| A Dynamic, Interpreted CheckList for Meaning-oriented NLG Metric Evaluation -- through the Lens of Semantic Similarity Rating | May 24, 2022 | nlg evaluationSemantic Similarity | —Unverified | 0 |
| The Authenticity Gap in Human Evaluation | May 24, 2022 | nlg evaluationSingle Particle Analysis | —Unverified | 0 |
| Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications | May 13, 2022 | nlg evaluationText Generation | —Unverified | 0 |
| Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets | May 13, 2022 | nlg evaluationQuestion Answering | CodeCode Available | 0 |
| Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation and Understanding | Apr 16, 2022 | Cross-Lingual Natural Language InferenceNatural Language Inference | CodeCode Available | 0 |
| Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons | Mar 11, 2022 | nlg evaluation | CodeCode Available | 1 |
| Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text | Feb 14, 2022 | nlg evaluationText Generation | —Unverified | 0 |