SOTAVerified

nlg evaluation

Evaluate the generated text by NLG (Natural Language Generation) systems, like large language models

Papers

Showing 125 of 71 papers

TitleStatusHype
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference ChecklistCode3
Towards a Unified Multi-Dimensional Evaluator for Text GenerationCode2
Themis: A Reference-free NLG Evaluation Language Model with Flexibility and InterpretabilityCode1
Leveraging Large Language Models for NLG Evaluation: Advances and ChallengesCode1
LUNA: A Framework for Language Understanding and Naturalness AssessmentCode1
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement TheoryCode1
G-Eval: NLG Evaluation using GPT-4 with Better Human AlignmentCode1
Is ChatGPT a Good NLG Evaluator? A Preliminary StudyCode1
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error SynthesisCode1
Active Evaluation: Efficient NLG Evaluation with Few Pairwise ComparisonsCode1
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language GenerationCode1
Long-Form Information Alignment Evaluation Beyond Atomic FactsCode0
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts0
DeepSeek vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?0
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMsCode0
Exploring the Multilingual NLG Evaluation Abilities of LLM-Based Evaluators0
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text0
Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking0
Analyzing and Evaluating Correlation Measures in NLG Meta-EvaluationCode0
Large Language Models Are Active Critics in NLG Evaluation0
DHP Benchmark: Are LLMs Good NLG Evaluators?0
ReFeR: Improving Evaluation and Reasoning through Hierarchy of ModelsCode0
Better than Random: Reliable NLG Human Evaluation with Constrained Active SamplingCode0
Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG EvaluationCode0
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language ModelsCode0
Show:102550
← PrevPage 1 of 3Next →

No leaderboard results yet.