| ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies | Mar 2, 2024 | Multiple-choice | CodeCode Available | 1 |
| NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism | Feb 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents | Feb 27, 2024 | Document ClassificationLanguage Modeling | CodeCode Available | 1 |
| Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property | Feb 26, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| SportQA: A Benchmark for Sports Understanding in Large Language Models | Feb 24, 2024 | Few-Shot LearningMultiple-choice | CodeCode Available | 1 |
| Uncertainty-Aware Evaluation for Vision-Language Models | Feb 22, 2024 | Conformal PredictionLanguage Modeling | CodeCode Available | 1 |
| BiMediX: Bilingual Medical Mixture of Experts LLM | Feb 20, 2024 | Mixture-of-ExpertsMultiple-choice | CodeCode Available | 1 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| The Effect of Sampling Temperature on Problem Solving in Large Language Models | Feb 7, 2024 | Multiple-choicePrompt Engineering | CodeCode Available | 1 |
| SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | Feb 6, 2024 | AttributeFace Anti-Spoofing | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 |
| LongHealth: A Question Answering Benchmark with Long Clinical Documents | Jan 25, 2024 | Information RetrievalMultiple-choice | CodeCode Available | 1 |
| The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models | Jan 11, 2024 | MathMultiple-choice | CodeCode Available | 1 |
| HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses | Dec 26, 2023 | DiversityKnowledge Graphs | CodeCode Available | 1 |
| RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models | Dec 26, 2023 | MemorizationMultiple-choice | CodeCode Available | 1 |
| An In-depth Look at Gemini's Language Abilities | Dec 18, 2023 | Instruction FollowingMath | CodeCode Available | 1 |
| Marathon: A Race Through the Realm of Long Context with Large Language Models | Dec 15, 2023 | Long-Context UnderstandingMultiple-choice | CodeCode Available | 1 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 |
| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis | Nov 2, 2023 | Density EstimationDiversity | CodeCode Available | 1 |