SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 2130 of 69 papers

TitleStatusHype
Enterprise Large Language Model Evaluation Benchmark0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.