| FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models | Apr 29, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| Framing QA as Building and Ranking Intersentence Answer Justifications | Jun 1, 2017 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 |
| From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project | Sep 4, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents | Nov 12, 2024 | General KnowledgeHallucination | —Unverified | 0 |
| Fundamental Limitations in Defending LLM Finetuning APIs | Feb 20, 2025 | Multiple-choice | —Unverified | 0 |
| FusionMind -- Improving question and answering with external context fusion | Dec 31, 2023 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| GANDALF: a General Character Name Description Dataset for Long Fiction | Nov 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 |