SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
ViDAS: Vision-based Danger Assessment and Scoring0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests0
Large Language Model Evaluation via Matrix Nuclear-NormCode0
PrOnto: Language Model Evaluations for 859 LanguagesCode0
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing PlatformCode0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Enterprise Benchmarks for Large Language Model EvaluationCode0
Mitigating the Bias of Large Language Model EvaluationCode0
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment DomainCode0
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and BridgingCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Mind the Gap: Assessing Temporal Generalization in Neural Language ModelsCode0
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model EvaluationCode0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.