SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 2650 of 69 papers

TitleStatusHype
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
On Speeding Up Language Model Evaluation0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and BridgingCode0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Evalverse: Unified and Accessible Library for Large Language Model EvaluationCode3
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing PlatformCode0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Advancing Chinese biomedical text mining with community challenges0
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for CodeCode4
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
Show:102550
← PrevPage 2 of 3Next →

No leaderboard results yet.