SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 2130 of 69 papers

TitleStatusHype
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model EvaluationCode0
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation0
Show:102550
← PrevPage 3 of 7Next →

No leaderboard results yet.