SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 125 of 69 papers

TitleStatusHype
Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for CodeCode4
Evalverse: Unified and Accessible Library for Large Language Model EvaluationCode3
C^2LEVA: Toward Comprehensive and Contamination-Free Language Model EvaluationCode2
AgentSims: An Open-Source Sandbox for Large Language Model EvaluationCode2
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill SetsCode2
BigBIO: A Framework for Data-Centric Biomedical Natural Language ProcessingCode2
Role-Playing Evaluation for Large Language ModelsCode1
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment AnalysisCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and TrainingCode1
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNACode1
Salmon: A Suite for Acoustic Language Model EvaluationCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific ResearchCode1
C-STS: Conditional Semantic Textual SimilarityCode1
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract MeaningCode1
Enterprise Large Language Model Evaluation Benchmark0
Finance Language Model Evaluation (FLaME)0
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models0
FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model EvaluationCode0
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation0
Show:102550
← PrevPage 1 of 3Next →

No leaderboard results yet.