SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 3140 of 69 papers

TitleStatusHype
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
Show:102550
← PrevPage 4 of 7Next →

No leaderboard results yet.