SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
ViDAS: Vision-based Danger Assessment and Scoring0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
Large Language Model Evaluation via Matrix Nuclear-NormCode0
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model EvaluationCode0
PrOnto: Language Model Evaluations for 859 LanguagesCode0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing PlatformCode0
Mitigating the Bias of Large Language Model EvaluationCode0
Enterprise Benchmarks for Large Language Model EvaluationCode0
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment DomainCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and BridgingCode0
Show:102550
← PrevPage 2 of 2Next →

No leaderboard results yet.