| Constructing Narrative Event Evolutionary Graph for Script Event Prediction | May 14, 2018 | Graph Neural NetworkMultiple-choice | CodeCode Available | 1 | 5 |
| Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers | Dec 7, 2023 | MathMultiple-choice | CodeCode Available | 1 | 5 |
| Logic-Guided Data Augmentation and Regularization for Consistent Question Answering | Apr 21, 2020 | Data AugmentationMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? | May 18, 2025 | Logical ReasoningMultimodal Reasoning | CodeCode Available | 1 | 5 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |
| IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages | Nov 8, 2020 | Genre classificationMultiple-choice | CodeCode Available | 1 | 5 |
| IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce | Jun 14, 2024 | Multiple-choiceQuestion Answering | CodeCode Available | 1 | 5 |
| M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models | May 17, 2023 | Instruction FollowingMultiple-choice | CodeCode Available | 1 | 5 |
| Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting | May 7, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |