| GPQA: A Graduate-Level Google-Proof Q&A Benchmark | Nov 20, 2023 | Multiple-choice | CodeCode Available | 2 | 5 |
| MovieChat: From Dense Token to Sparse Memory for Long Video Understanding | Jul 31, 2023 | Multiple-choiceQuestion Answering | CodeCode Available | 2 | 5 |
| BRAINTEASER: Lateral Thinking Puzzles for Large Language Models | Oct 8, 2023 | Distractor GenerationLanguage Modelling | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models | Oct 5, 2023 | Common Sense ReasoningMultiple-choice | CodeCode Available | 1 | 5 |
| BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages | Jun 14, 2024 | Multiple-choice | CodeCode Available | 1 | 5 |
| A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies | Oct 12, 2020 | Distractor GenerationMultiple-choice | CodeCode Available | 1 | 5 |
| Evaluating the Knowledge Dependency of Questions | Nov 21, 2022 | Multiple-choice | CodeCode Available | 1 | 5 |
| Bridging Video-text Retrieval with Multiple Choice Questions | Jan 13, 2022 | Action RecognitionLinear evaluation | CodeCode Available | 1 | 5 |
| Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams | Mar 29, 2023 | Multiple-choice | CodeCode Available | 1 | 5 |