| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 | 0 |
| Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh | Feb 19, 2025 | Instruction FollowingMultiple-choice | —Unverified | 0 | 0 |
| Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages | Dec 1, 2024 | ARCMultiple-choice | —Unverified | 0 | 0 |
| Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering | Aug 6, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation | Jun 8, 2024 | Abstractive Text SummarizationDialogue Generation | —Unverified | 0 | 0 |
| Investigating Data Contamination in Modern Benchmarks for Large Language Models | Nov 16, 2023 | Common Sense ReasoningMMLU | —Unverified | 0 | 0 |
| Self-Assessment Tests are Unreliable Measures of LLM Personality | Sep 15, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| Investigating the Effectiveness of ChatGPT in Mathematical Reasoning and Problem Solving: Evidence from the Vietnamese National High School Graduation Examination | Jun 10, 2023 | MathMathematical Reasoning | —Unverified | 0 | 0 |
| Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting | Oct 18, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Jun 18, 2025 | document understandingMultiple-choice | —Unverified | 0 | 0 |