SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
Advancing Chinese biomedical text mining with community challenges0
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
Pseudointelligence: A Unifying Framework for Language Model Evaluation0
PrOnto: Language Model Evaluations for 859 LanguagesCode0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Language Model Evaluation in Open-ended Text Generation0
Language Model Evaluation Beyond Perplexity0
Mind the Gap: Assessing Temporal Generalization in Neural Language ModelsCode0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
Improving Explainable Recommendations with Synthetic Reviews0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.