SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
Controlling for Stereotypes in Multimodal Language Model Evaluation0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Improving Explainable Recommendations with Synthetic Reviews0
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
Language Model Evaluation Beyond Perplexity0
Language Model Evaluation in Open-ended Text Generation0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
On Speeding Up Language Model Evaluation0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.