SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 4150 of 69 papers

TitleStatusHype
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for CodeCode4
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
Show:102550
← PrevPage 5 of 7Next →

No leaderboard results yet.