SOTAVerified|Agents Browse Leaderboard About Blog

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–69 of 69 papers

Title	Date	Tasks	Status	Hype	Score
Controlling for Stereotypes in Multimodal Language Model Evaluation	Feb 3, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0	0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain	Feb 11, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation	May 24, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs	Apr 22, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation	Nov 29, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Enterprise Large Language Model Evaluation Benchmark	Jun 25, 2025	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Finance Language Model Evaluation (FLaME)	Jun 18, 2025	BenchmarkingLanguage Model Evaluation	—Unverified	0	0
Generalization Measures for Zero-Shot Cross-Lingual Transfer	Apr 24, 2024	Cross-Lingual TransferLanguage Model Evaluation	—Unverified	0	0
Improving Explainable Recommendations with Synthetic Reviews	Jul 18, 2018	Language Model EvaluationLanguage Modeling	—Unverified	0	0
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization	May 24, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing	Oct 19, 2023	DecoderLanguage Model Evaluation	—Unverified	0	0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation	Jul 1, 2022	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Language Model Evaluation Beyond Perplexity	May 31, 2021	Language Model EvaluationLanguage Modeling	—Unverified	0	0
Language Model Evaluation in Open-ended Text Generation	Aug 8, 2021	AttributeDiversity	—Unverified	0	0
Lessons from the Trenches on Reproducible Evaluation of Language Models	May 23, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests	Dec 17, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation	Oct 21, 2023	BenchmarkingLanguage Model Evaluation	—Unverified	0	0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation	Mar 13, 2025	Language Model EvaluationLanguage Modeling	—Unverified	0	0
On Speeding Up Language Model Evaluation	Jul 8, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0	0

Show:10 25 50

← PrevPage 2 of 2Next →

No leaderboard results yet.