SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 4150 of 69 papers

TitleStatusHype
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
Benchmarking Harmonized Tariff Schedule Classification Models0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.