| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 |
| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic | Feb 20, 2024 | ArabicMMLULanguage Model Evaluation | CodeCode Available | 1 |
| OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models | Oct 11, 2023 | HallucinationIn-Context Learning | CodeCode Available | 1 |
| Option Tracing: Beyond Correctness Analysis in Knowledge Tracing | Apr 19, 2021 | Knowledge TracingMultiple-choice | CodeCode Available | 1 |
| ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks | Jul 8, 2024 | Anomaly DetectionCode Generation | CodeCode Available | 1 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |
| A Few More Examples May Be Worth Billions of Parameters | Oct 8, 2021 | Extractive Question-AnsweringMultiple-choice | CodeCode Available | 1 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| CC-Riddle: A Question Answering Dataset of Chinese Character Riddles | Jun 28, 2022 | General KnowledgeLanguage Modelling | CodeCode Available | 1 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| R2DE: a NLP approach to estimating IRT parameters of newly generated questions | Jan 21, 2020 | Multiple-choiceQuestion Generation | CodeCode Available | 1 |
| Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis | Nov 2, 2023 | Density EstimationDiversity | CodeCode Available | 1 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies. | Nov 1, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | Feb 24, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| Explaining NLP Models via Minimal Contrastive Editing (MiCE) | Dec 27, 2020 | counterfactualMultiple-choice | CodeCode Available | 1 |
| SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation | May 14, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 |
| Evaluating language models as risk scores | Jul 19, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 |
| SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | Feb 6, 2024 | AttributeFace Anti-Spoofing | CodeCode Available | 1 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 |
| Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive Framework | Jul 24, 2023 | Contrastive LearningMultimodal Reasoning | CodeCode Available | 1 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Evaluating the Knowledge Dependency of Questions | Nov 21, 2022 | Multiple-choice | CodeCode Available | 1 |
| Taming Overconfidence in LLMs: Reward Calibration in RLHF | Oct 13, 2024 | Multiple-choice | CodeCode Available | 1 |
| Clues Before Answers: Generation-Enhanced Multiple-Choice QA | Apr 30, 2022 | DecoderMultiple-choice | CodeCode Available | 1 |
| EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain | Oct 12, 2022 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 |
| HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses | Dec 26, 2023 | DiversityKnowledge Graphs | CodeCode Available | 1 |
| TIMEDIAL: Temporal Commonsense Reasoning in Dialog | Jun 8, 2021 | Multiple-choiceTimedial | CodeCode Available | 1 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 |
| CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset | Mar 8, 2025 | Multiple-choice | CodeCode Available | 1 |
| ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind | Jan 15, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 |
| TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering | Aug 27, 2024 | Multiple-choiceProtein Folding | CodeCode Available | 1 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies | Oct 12, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| TSQA: Tabular Scenario Based Question Answering | Jan 14, 2021 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 1 |
| TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes | Feb 4, 2025 | Autonomous DrivingMultiple-choice | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models | Jul 15, 2024 | Backdoor AttackMultiple-choice | CodeCode Available | 1 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| Unsupervised Commonsense Question Answering with Self-Talk | Apr 11, 2020 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 |
| ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning | Apr 15, 2021 | Graph GenerationMultiple-choice | CodeCode Available | 1 |
| IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation | May 16, 2025 | Multiple-choice | CodeCode Available | 1 |