SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Improving Explainable Recommendations with Synthetic Reviews0
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
Language Model Evaluation Beyond Perplexity0
Language Model Evaluation in Open-ended Text Generation0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
On Speeding Up Language Model Evaluation0
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.