SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 2650 of 69 papers

TitleStatusHype
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and BridgingCode0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Mind the Gap: Assessing Temporal Generalization in Neural Language ModelsCode0
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment DomainCode0
PrOnto: Language Model Evaluations for 859 LanguagesCode0
Large Language Model Evaluation via Matrix Nuclear-NormCode0
Pseudointelligence: A Unifying Framework for Language Model Evaluation0
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
ViDAS: Vision-based Danger Assessment and Scoring0
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
Advancing Chinese biomedical text mining with community challenges0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
Benchmarking Harmonized Tariff Schedule Classification Models0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.