| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Logic-Guided Data Augmentation and Regularization for Consistent Question Answering | Apr 21, 2020 | Data AugmentationMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Enhancing Knowledge Tracing with Concept Map and Response Disentanglement | Aug 23, 2024 | DisentanglementKnowledge Tracing | CodeCode Available | 1 | 5 |
| Clues Before Answers: Generation-Enhanced Multiple-Choice QA | Apr 30, 2022 | DecoderMultiple-choice | CodeCode Available | 1 | 5 |
| CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset | Mar 8, 2025 | Multiple-choice | CodeCode Available | 1 | 5 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |
| JMedLoRA:Medical Domain Adaptation on Japanese Large Language Models using Instruction-tuning | Oct 16, 2023 | Domain AdaptationMedical Question Answering | CodeCode Available | 1 | 5 |
| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Oct 2, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 | 5 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 | 5 |
| Training Trajectories of Language Models Across Scales | Dec 19, 2022 | In-Context LearningMultiple-choice | CodeCode Available | 1 | 5 |
| Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning | May 22, 2024 | Mathematical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| Knowledge Graph-Augmented Abstractive Summarization with Semantic-Driven Cloze Reward | May 3, 2020 | Abstractive Text SummarizationCloze Test | CodeCode Available | 1 | 5 |
| TSQA: Tabular Scenario Based Question Answering | Jan 14, 2021 | Machine Reading ComprehensionMultiple-choice | CodeCode Available | 1 | 5 |
| Counterfactual Variable Control for Robust and Interpretable Question Answering | Oct 12, 2020 | Causal Inferencecounterfactual | CodeCode Available | 1 | 5 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies | Oct 12, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| Language Model Uncertainty Quantification with Attention Chain | Mar 24, 2025 | Computational EfficiencyLanguage Modeling | CodeCode Available | 1 | 5 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 | 5 |
| CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge | Nov 2, 2018 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models | Aug 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Complex Reasoning over Logical Queries on Commonsense Knowledge Graphs | Mar 12, 2024 | Knowledge GraphsMultiple-choice | CodeCode Available | 1 | 5 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 | 5 |
| LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | May 18, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 1 | 5 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Constructing Narrative Event Evolutionary Graph for Script Event Prediction | May 14, 2018 | Graph Neural NetworkMultiple-choice | CodeCode Available | 1 | 5 |
| Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark | Mar 26, 2025 | MMLUMultiple-choice | CodeCode Available | 1 | 5 |