| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |
| AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? | Dec 3, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Clues Before Answers: Generation-Enhanced Multiple-Choice QA | Apr 30, 2022 | DecoderMultiple-choice | CodeCode Available | 1 | 5 |
| IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages | Nov 8, 2020 | Genre classificationMultiple-choice | CodeCode Available | 1 | 5 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| An MRC Framework for Semantic Role Labeling | Sep 14, 2021 | Computational EfficiencyMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing | Jul 22, 2024 | AllDiversity | CodeCode Available | 1 | 5 |
| Conformal Prediction with Large Language Models for Multi-Choice Question Answering | May 28, 2023 | Conformal PredictionMultiple-choice | CodeCode Available | 1 | 5 |
| Annealed Winner-Takes-All for Motion Forecasting | Sep 17, 2024 | AllAutonomous Driving | CodeCode Available | 1 | 5 |
| Counterfactual Variable Control for Robust and Interpretable Question Answering | Oct 12, 2020 | Causal Inferencecounterfactual | CodeCode Available | 1 | 5 |
| An Open Source Data Contamination Report for Large Language Models | Oct 26, 2023 | HellaSwagLanguage Modeling | CodeCode Available | 1 | 5 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 | 5 |
| Ranked Voting based Self-Consistency of Large Language Models | May 16, 2025 | Multiple-choiceOpen-Ended Question Answering | CodeCode Available | 1 | 5 |
| CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset | Mar 8, 2025 | Multiple-choice | CodeCode Available | 1 | 5 |
| Large Language Models Encode Clinical Knowledge | Dec 26, 2022 | Clinical KnowledgeMedQA | CodeCode Available | 1 | 5 |
| Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models | Mar 20, 2025 | Multiple-choiceVideo Understanding | CodeCode Available | 1 | 5 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 | 5 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 | 5 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework | May 22, 2025 | Multiple-choiceVisual Question Answering (VQA) | CodeCode Available | 1 | 5 |
| ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Apr 7, 2025 | Chart Question AnsweringChart Understanding | CodeCode Available | 1 | 5 |