SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 1120 of 69 papers

TitleStatusHype
M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment AnalysisCode1
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNACode1
Catwalk: A Unified Language Model Evaluation Framework for Many DatasetsCode1
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test ConstructionCode1
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model EvaluationCode1
ArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicCode1
C-STS: Conditional Semantic Textual SimilarityCode1
Role-Playing Evaluation for Large Language ModelsCode1
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract MeaningCode1
Show:102550
← PrevPage 2 of 7Next →

No leaderboard results yet.