SOTAVerified

Language Model Evaluation

The task of using LLMs as evaluators of large language and vision language models.

Papers

Showing 5169 of 69 papers

TitleStatusHype
Pseudointelligence: A Unifying Framework for Language Model Evaluation0
Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model EvaluationCode1
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific ResearchCode1
AgentSims: An Open-Source Sandbox for Large Language Model EvaluationCode2
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill SetsCode2
C-STS: Conditional Semantic Textual SimilarityCode1
PrOnto: Language Model Evaluations for 859 LanguagesCode0
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs0
Controlling for Stereotypes in Multimodal Language Model Evaluation0
A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation0
BigBIO: A Framework for Data-Centric Biomedical Natural Language ProcessingCode2
BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence0
Language Model Evaluation in Open-ended Text Generation0
Language Model Evaluation Beyond Perplexity0
ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract MeaningCode1
Mind the Gap: Assessing Temporal Generalization in Neural Language ModelsCode0
CLiMP: A Benchmark for Chinese Language Model Evaluation0
Improving Explainable Recommendations with Synthetic Reviews0
Contrastive Entropy: A new evaluation metric for unnormalized language models0
Show:102550
← PrevPage 3 of 3Next →

No leaderboard results yet.