SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 150 of 69 papers

TitleStatusHype
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for CodeCode4
Evalverse: Unified and Accessible Library for Large Language Model EvaluationCode3
C^2LEVA: Toward Comprehensive and Contamination-Free Language Model EvaluationCode2
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill SetsCode2
BigBIO: A Framework for Data-Centric Biomedical Natural Language ProcessingCode2
AgentSims: An Open-Source Sandbox for Large Language Model EvaluationCode2
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific ResearchCode1
Role-Playing Evaluation for Large Language ModelsCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment AnalysisCode1
Salmon: A Suite for Acoustic Language Model EvaluationCode1
C-STS: Conditional Semantic Textual SimilarityCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNACode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract MeaningCode1
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and TrainingCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization0
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing0
KMMLU: Measuring Massive Multitask Language Understanding in Korean0
Language Model Evaluation Beyond Perplexity0
Language Model Evaluation in Open-ended Text Generation0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Advancing Chinese biomedical text mining with community challenges0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
Benchmarking Harmonized Tariff Schedule Classification Models0
Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks0
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation0
Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain0
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation0
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
Generalization Measures for Zero-Shot Cross-Lingual Transfer0
Improving Explainable Recommendations with Synthetic Reviews0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation0
On Speeding Up Language Model Evaluation0
Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation0
Pseudointelligence: A Unifying Framework for Language Model Evaluation0
Show:102550
← PrevPage 1 of 2Next →

No leaderboard results yet.