SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 150 of 69 papers

TitleStatusHype
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model EvaluationCode0
Role-Playing Evaluation for Large Language ModelsCode1
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation0
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment AnalysisCode1
Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment DomainCode0
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
LMUnit: Fine-grained Evaluation with Natural Language Unit Tests0
Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and TrainingCode1
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNACode1
C^2LEVA: Toward Comprehensive and Contamination-Free Language Model EvaluationCode2
Benchmarking Harmonized Tariff Schedule Classification Models0
Large Language Model Evaluation via Matrix Nuclear-NormCode0
Enterprise Benchmarks for Large Language Model EvaluationCode0
ViDAS: Vision-based Danger Assessment and Scoring0
Mitigating the Bias of Large Language Model EvaluationCode0
Salmon: A Suite for Acoustic Language Model EvaluationCode1
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
On Speeding Up Language Model Evaluation0
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model EvaluationCode0
Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization0
Lessons from the Trenches on Reproducible Evaluation of Language Models0
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and BridgingCode0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language ModelsCode0
Evalverse: Unified and Accessible Library for Large Language Model EvaluationCode3
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing PlatformCode0
Rethinking Generative Large Language Model Evaluation for Semantic Comprehension0
Advancing Chinese biomedical text mining with community challenges0
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for CodeCode4
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.