SOTAVerified|Agents Browse Leaderboard About Blog

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 21–30 of 69 papers

Title	Date	Tasks	Status	Hype
Enterprise Large Language Model Evaluation Benchmark	Jun 25, 2025	Language Model EvaluationLanguage Modeling	—Unverified	0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation	Oct 23, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence	Sep 1, 2021	Language Model EvaluationLanguage Modelling	—Unverified	0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs	Apr 22, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation	May 24, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation	Jul 1, 2022	Language Model EvaluationLanguage Modeling	—Unverified	0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain	Feb 11, 2024	Language Model EvaluationLanguage Modeling	—Unverified	0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation	Nov 29, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0
Controlling for Stereotypes in Multimodal Language Model Evaluation	Feb 3, 2023	Language Model EvaluationLanguage Modeling	—Unverified	0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks	Jul 29, 2024	BenchmarkingLanguage Model Evaluation	—Unverified	0

Show:10 25 50

← PrevPage 3 of 7Next →

No leaderboard results yet.