| Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code | Nov 14, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 4 |
| Evalverse: Unified and Accessible Library for Large Language Model Evaluation | Apr 1, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 3 |
| C^2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation | Dec 6, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 2 |
| FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets | Jul 20, 2023 | Instruction FollowingLanguage Model Evaluation | CodeCode Available | 2 |
| BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing | Jun 30, 2022 | DiversityLanguage Model Evaluation | CodeCode Available | 2 |
| AgentSims: An Open-Source Sandbox for Large Language Model Evaluation | Aug 8, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 2 |
| SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research | Aug 25, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Role-Playing Evaluation for Large Language Models | May 19, 2025 | Language Model Evaluation | CodeCode Available | 1 |
| LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction | Dec 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis | Feb 17, 2025 | Aspect-Based Sentiment AnalysisAspect-Based Sentiment Analysis (ABSA) | CodeCode Available | 1 |
| Salmon: A Suite for Acoustic Language Model Evaluation | Sep 11, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| C-STS: Conditional Semantic Textual Similarity | May 24, 2023 | Information RetrievalLanguage Model Evaluation | CodeCode Available | 1 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA | Dec 6, 2024 | counterfactualLanguage Model Evaluation | CodeCode Available | 1 |
| Catwalk: A Unified Language Model Evaluation Framework for Many Datasets | Dec 15, 2023 | In-Context LearningLanguage Model Evaluation | CodeCode Available | 1 |
| ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning | Feb 25, 2021 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation | Dec 28, 2023 | GSM8KLanguage Model Evaluation | CodeCode Available | 1 |
| Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training | Dec 11, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| iREPO: implicit Reward Pairwise Difference based Empirical Preference Optimization | May 24, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing | Oct 19, 2023 | DecoderLanguage Model Evaluation | —Unverified | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 |
| Language Model Evaluation Beyond Perplexity | May 31, 2021 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Language Model Evaluation in Open-ended Text Generation | Aug 8, 2021 | AttributeDiversity | —Unverified | 0 |
| Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs | Apr 22, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Advancing Chinese biomedical text mining with community challenges | Mar 7, 2024 | AttributeAttribute Extraction | —Unverified | 0 |
| BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models | Jun 2, 2025 | Language Model Evaluation | —Unverified | 0 |
| Benchmarking Harmonized Tariff Schedule Classification Models | Dec 4, 2024 | BenchmarkingClassification | —Unverified | 0 |
| Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks | Jul 29, 2024 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence | Sep 1, 2021 | Language Model EvaluationLanguage Modelling | —Unverified | 0 |
| Branch-Solve-Merge Improves Large Language Model Evaluation and Generation | Oct 23, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| CLiMP: A Benchmark for Chinese Language Model Evaluation | Jan 26, 2021 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation | Apr 29, 2025 | Code GenerationLanguage Model Evaluation | —Unverified | 0 |
| Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges | Apr 30, 2025 | Bayesian InferenceLanguage Model Evaluation | —Unverified | 0 |
| Contrastive Entropy: A new evaluation metric for unnormalized language models | Jan 3, 2016 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Controlling for Stereotypes in Multimodal Language Model Evaluation | Feb 3, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| CPSDBench: A Large Language Model Evaluation Benchmark and Baseline for Chinese Public Security Domain | Feb 11, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation | May 24, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| A Dog Is Passing Over The Jet? A Text-Generation Dataset for Korean Commonsense Reasoning and Evaluation | Jul 1, 2022 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Elo Uncovered: Robustness and Best Practices in Language Model Evaluation | Nov 29, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Enterprise Large Language Model Evaluation Benchmark | Jun 25, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Finance Language Model Evaluation (FLaME) | Jun 18, 2025 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| Generalization Measures for Zero-Shot Cross-Lingual Transfer | Apr 24, 2024 | Cross-Lingual TransferLanguage Model Evaluation | —Unverified | 0 |
| Improving Explainable Recommendations with Synthetic Reviews | Jul 18, 2018 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation | Mar 13, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| On Speeding Up Language Model Evaluation | Jul 8, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |
| Predicting Liquidity-Aware Bond Yields using Causal GANs and Deep Reinforcement Learning with LLM Evaluation | Feb 24, 2025 | Decision MakingDeep Reinforcement Learning | —Unverified | 0 |
| Pseudointelligence: A Unifying Framework for Language Model Evaluation | Oct 18, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 |