| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? | Dec 3, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Constructing Narrative Event Evolutionary Graph for Script Event Prediction | May 14, 2018 | Graph Neural NetworkMultiple-choice | CodeCode Available | 1 | 5 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 | 5 |
| Leveraging Large Language Models for Multiple Choice Question Answering | Oct 22, 2022 | Answer SelectionMultiple-choice | CodeCode Available | 1 | 5 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| An MRC Framework for Semantic Role Labeling | Sep 14, 2021 | Computational EfficiencyMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 | 5 |
| Leaf: Multiple-Choice Question Generation | Jan 22, 2022 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | Nov 2, 2018 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding | Jun 8, 2024 | DescriptiveLanguage Modelling | CodeCode Available | 1 | 5 |
| LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts | Jul 6, 2024 | Logical ReasoningMathematical Reasoning | CodeCode Available | 1 | 5 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 | 5 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Latxa: An Open Language Model and Evaluation Suite for Basque | Mar 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 | 5 |
| Marathon: A Race Through the Realm of Long Context with Large Language Models | Dec 15, 2023 | Long-Context UnderstandingMultiple-choice | CodeCode Available | 1 | 5 |
| CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models | Sep 5, 2023 | Code GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | May 23, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 | 5 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom | Apr 30, 2024 | ImplicaturesMultiple-choice | CodeCode Available | 1 | 5 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |
| CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training | Jun 15, 2024 | Domain AdaptationLanguage Modeling | CodeCode Available | 1 | 5 |
| BiMediX: Bilingual Medical Mixture of Experts LLM | Feb 20, 2024 | Mixture-of-ExpertsMultiple-choice | CodeCode Available | 1 | 5 |
| E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models | Jan 29, 2024 | EthicsMultiple-choice | CodeCode Available | 1 | 5 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 | 5 |
| Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework | May 22, 2025 | Multiple-choiceVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 | 5 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation | Jun 4, 2025 | Multiple-choice | CodeCode Available | 1 | 5 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 | 5 |
| A Few More Examples May Be Worth Billions of Parameters | Oct 8, 2021 | Extractive Question-AnsweringMultiple-choice | CodeCode Available | 1 | 5 |
| MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Oct 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 | 5 |
| Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation | Jan 6, 2025 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 | 5 |
| Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward | May 3, 2020 | Abstractive Text SummarizationCloze Test | CodeCode Available | 1 | 5 |
| Clues Before Answers: Generation-Enhanced Multiple-Choice QA | Apr 30, 2022 | DecoderMultiple-choice | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| Multiple Choice Questions based Multi-Interest Policy Learning for Conversational Recommendation | Dec 22, 2021 | AttributeConversational Recommendation | CodeCode Available | 1 | 5 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| NarrativeXL: A Large-scale Dataset For Long-Term Memory Models | May 23, 2023 | Multiple-choiceReading Comprehension | CodeCode Available | 1 | 5 |
| Explaining NLP Models via Minimal Contrastive Editing (MiCE) | Dec 27, 2020 | counterfactualMultiple-choice | CodeCode Available | 1 | 5 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies. | Nov 1, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models | Feb 24, 2025 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |