| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 |
| Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering | Aug 6, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation | Jun 8, 2024 | Abstractive Text SummarizationDialogue Generation | —Unverified | 0 |
| Investigating Data Contamination in Modern Benchmarks for Large Language Models | Nov 16, 2023 | Common Sense ReasoningMMLU | —Unverified | 0 |
| Self-Assessment Tests are Unreliable Measures of LLM Personality | Sep 15, 2023 | Multiple-choice | —Unverified | 0 |
| Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination | Jun 10, 2023 | MathMathematical Reasoning | —Unverified | 0 |
| Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting | Oct 18, 2023 | Multiple-choice | —Unverified | 0 |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Jun 18, 2025 | document understandingMultiple-choice | —Unverified | 0 |
| ISAAQ -- Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Oct 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | Nov 1, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 |