| GPT Takes the Bar Exam | Dec 29, 2022 | Hyperparameter OptimizationMultiple-choice | CodeCode Available | 1 | 5 |
| General-Purpose Question-Answering with Macaw | Sep 6, 2021 | Generative Question AnsweringMultiple-choice | CodeCode Available | 1 | 5 |
| ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Apr 7, 2025 | Chart Question AnsweringChart Understanding | CodeCode Available | 1 | 5 |
| Generating Distractors for Reading Comprehension Questions from Real Examinations | Sep 8, 2018 | DecoderDistractor Generation | CodeCode Available | 1 | 5 |
| CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning | Jan 25, 2024 | Multiple-choicePosition | CodeCode Available | 1 | 5 |
| Fine-tuning Multimodal Large Language Models for Product Bundling | Jul 16, 2024 | In-Context LearningMultiple-choice | CodeCode Available | 1 | 5 |
| All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages | Nov 25, 2024 | AllLong Question Answer | CodeCode Available | 1 | 5 |
| FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture | Jun 16, 2024 | DiversityMultiple-choice | CodeCode Available | 1 | 5 |
| Fool Your (Vision and) Language Model With Embarrassingly Simple Permutations | Oct 2, 2023 | In-Context LearningInstruction Following | CodeCode Available | 1 | 5 |
| FarsTail: A Persian Natural Language Inference Dataset | Sep 18, 2020 | Multiple-choiceNatural Language Inference | CodeCode Available | 1 | 5 |
| Fake Alignment: Are LLMs Really Aligned Well? | Nov 10, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue | May 12, 2022 | Dialogue UnderstandingDomain Adaptation | CodeCode Available | 1 | 5 |
| Explaining NLP Models via Minimal Contrastive Editing (MiCE) | Dec 27, 2020 | counterfactualMultiple-choice | CodeCode Available | 1 | 5 |
| CC-Riddle: A Question Answering Dataset of Chinese Character Riddles | Jun 28, 2022 | General KnowledgeLanguage Modelling | CodeCode Available | 1 | 5 |
| FaceXBench: Evaluating Multimodal LLMs on Face Understanding | Jan 17, 2025 | FairnessMultiple-choice | CodeCode Available | 1 | 5 |
| Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion | Dec 12, 2024 | HallucinationKnowledge Graph Completion | CodeCode Available | 1 | 5 |
| From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap | Apr 13, 2020 | Dialogue State TrackingMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| HCQA @ Ego4D EgoSchema Challenge 2024 | Jun 22, 2024 | Caption Generation | CodeCode Available | 1 | 5 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension | Mar 16, 2022 | Logical ReasoningMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Can large language models reason about medical questions? | Jul 17, 2022 | MedQAMultiple-choice | CodeCode Available | 1 | 5 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing | May 16, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 1 | 5 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 | 5 |