SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 2650 of 69 papers

TitleStatusHype
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Advancing Chinese biomedical text mining with community challenges0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
Benchmarking Harmonized Tariff Schedule Classification Models0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Improving Explainable Recommendations with Synthetic Reviews0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
On Speeding Up Language Model Evaluation0
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation0
Pseudointelligence: A Unifying Framework for Language Model Evaluation0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.