| Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code | Nov 14, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 4 | 5 |
| Evalverse: Unified and Accessible Library for Large Language Model Evaluation | Apr 1, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 3 | 5 |
| FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets | Jul 20, 2023 | Instruction FollowingLanguage Model Evaluation | CodeCode Available | 2 | 5 |
| C^2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation | Dec 6, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 2 | 5 |
| AgentSims: An Open-Source Sandbox for Large Language Model Evaluation | Aug 8, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 2 | 5 |
| BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing | Jun 30, 2022 | DiversityLanguage Model Evaluation | CodeCode Available | 2 | 5 |
| Catwalk: A Unified Language Model Evaluation Framework for Many Datasets | Dec 15, 2023 | In-Context LearningLanguage Model Evaluation | CodeCode Available | 1 | 5 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 | 5 |
| Salmon: A Suite for Acoustic Language Model Evaluation | Sep 11, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| C-STS: Conditional Semantic Textual Similarity | May 24, 2023 | Information RetrievalLanguage Model Evaluation | CodeCode Available | 1 | 5 |
| DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA | Dec 6, 2024 | counterfactualLanguage Model Evaluation | CodeCode Available | 1 | 5 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis | Feb 17, 2025 | Aspect-Based Sentiment AnalysisAspect-Based Sentiment Analysis (ABSA) | CodeCode Available | 1 | 5 |
| MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation | Dec 28, 2023 | GSM8KLanguage Model Evaluation | CodeCode Available | 1 | 5 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| Role-Playing Evaluation for Large Language Models | May 19, 2025 | Language Model Evaluation | CodeCode Available | 1 | 5 |
| SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research | Aug 25, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction | Dec 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training | Dec 11, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning | Feb 25, 2021 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| Mitigating the Bias of Large Language Model Evaluation | Sep 25, 2024 | Instruction FollowingLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation | May 30, 2025 | DiagnosticLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation | Jun 20, 2024 | GSM8KLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform | Mar 13, 2024 | Language Model EvaluationLanguage Modelling | CodeCode Available | 0 | 5 |
| Enterprise Benchmarks for Large Language Model Evaluation | Oct 11, 2024 | BenchmarkingLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging | May 20, 2024 | Language Model EvaluationLanguage Modeling | CodeCode Available | 0 | 5 |
| Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models | Apr 17, 2024 | FormLanguage Model Evaluation | CodeCode Available | 0 | 5 |
| Mind the Gap: Assessing Temporal Generalization in Neural Language Models | Feb 3, 2021 | Language Model EvaluationLanguage Modeling | CodeCode Available | 0 | 5 |
| Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain | Jan 10, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 0 | 5 |
| PrOnto: Language Model Evaluations for 859 Languages | May 22, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 0 | 5 |
| Large Language Model Evaluation via Matrix Nuclear-Norm | Oct 14, 2024 | Computational EfficiencyData Compression | CodeCode Available | 0 | 5 |
| Pseudointelligence: A Unifying Framework for Language Model Evaluation | Oct 18, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation | May 4, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Rethinking Generative Large Language Model Evaluation for Semantic Comprehension | Mar 12, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation | Jun 6, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation | Mar 19, 2025 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| ViDAS: Vision-based Danger Assessment and Scoring | Oct 1, 2024 | Fixed Few Shot PromptingFixed Few Shot Prompting Danger Assessment | —Unverified | 0 | 0 |
| KMMLU: Measuring Massive Multitask Language Understanding in Korean | Feb 18, 2024 | kmmluLanguage Model Evaluation | —Unverified | 0 | 0 |
| Advancing Chinese biomedical text mining with community challenges | Mar 7, 2024 | AttributeAttribute Extraction | —Unverified | 0 | 0 |
| BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models | Jun 2, 2025 | Language Model Evaluation | —Unverified | 0 | 0 |
| Benchmarking Harmonized Tariff Schedule Classification Models | Dec 4, 2024 | BenchmarkingClassification | —Unverified | 0 | 0 |
| Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks | Jul 29, 2024 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 | 0 |
| BPoMP: The Benchmark of Poetic Minimal Pairs – Limericks, Rhyme, and Narrative Coherence | Sep 1, 2021 | Language Model EvaluationLanguage Modelling | —Unverified | 0 | 0 |
| Branch-Solve-Merge Improves Large Language Model Evaluation and Generation | Oct 23, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| CLiMP: A Benchmark for Chinese Language Model Evaluation | Jan 26, 2021 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation | Apr 29, 2025 | Code GenerationLanguage Model Evaluation | —Unverified | 0 | 0 |
| Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges | Apr 30, 2025 | Bayesian InferenceLanguage Model Evaluation | —Unverified | 0 | 0 |
| Contrastive Entropy: A new evaluation metric for unnormalized language models | Jan 3, 2016 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Controlling for Stereotypes in Multimodal Language Model Evaluation | Feb 3, 2023 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |