| Latxa: An Open Language Model and Evaluation Suite for Basque | Mar 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework | May 22, 2025 | Multiple-choiceVisual Question Answering (VQA) | CodeCode Available | 1 |
| Leveraging Large Language Models for Learning Complex Legal Concepts through Storytelling | Feb 26, 2024 | Multiple-choice | CodeCode Available | 1 |
| LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models | Aug 20, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| LifeQA: A Real-life Dataset for Video Question Answering | May 1, 2020 | Multiple-choiceQuestion Answering | CodeCode Available | 1 |
| EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain | Oct 12, 2022 | Distractor GenerationMultiple-choice | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 |
| Delving into the Reversal Curse: How Far Can Large Language Models Generalize? | Oct 24, 2024 | Multiple-choice | CodeCode Available | 1 |