| Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model | Aug 23, 2019 | ArticlesLanguage Modeling | —Unverified | 0 | 0 |
| ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data | May 2, 2020 | Knowledge GraphsLanguage Modelling | —Unverified | 0 | 0 |
| FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models | Apr 29, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Framing QA as Building and Ranking Intersentence Answer Justifications | Jun 1, 2017 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project | Sep 4, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents | Nov 12, 2024 | General KnowledgeHallucination | —Unverified | 0 | 0 |
| Fundamental Limitations in Defending LLM Finetuning APIs | Feb 20, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| FusionMind -- Improving question and answering with external context fusion | Dec 31, 2023 | Knowledge GraphsMultiple-choice | —Unverified | 0 | 0 |
| GANDALF: a General Character Name Description Dataset for Long Fiction | Nov 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis | Nov 25, 2024 | Medical Visual Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| Generalised Winograd Schema and its Contextuality | Aug 31, 2023 | coreference-resolutionCoreference Resolution | —Unverified | 0 | 0 |
| Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data | Jul 20, 2024 | Language ModellingMachine Translation | —Unverified | 0 | 0 |
| Who did What: A Large-Scale Person-Centered Cloze Dataset | Aug 19, 2016 | ArticlesMultiple-choice | —Unverified | 0 | 0 |
| Generating Adequate Distractors for Multiple-Choice Questions | Oct 23, 2020 | FormMultiple-choice | —Unverified | 0 | 0 |
| Generating Correct Answers for Progressive Matrices Intelligence Tests | Nov 1, 2020 | Multiple-choice | —Unverified | 0 | 0 |
| Generating Diagnostic Multiple Choice Comprehension Cloze Questions | Jun 1, 2012 | DiagnosticMultiple-choice | —Unverified | 0 | 0 |
| Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers | Apr 21, 2023 | MathMultiple-choice | —Unverified | 0 | 0 |
| Generating multiple-choice questions for medical question answering with distractors and cue-masking | Mar 13, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction | Jan 21, 2025 | Distractor GenerationMisconceptions | —Unverified | 0 | 0 |
| Generating Questions and Multiple-Choice Answers using Semantic Analysis of Texts | Dec 1, 2016 | coreference-resolutionCoreference Resolution | —Unverified | 0 | 0 |
| GenNet : Reading Comprehension with Multiple Choice Questions using Generation and Selection model | Mar 3, 2020 | Answer GenerationMachine Reading Comprehension | —Unverified | 0 | 0 |
| Genome-Bench: A Scientific Reasoning Benchmark from Real-World Expert Discussions | May 26, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks | Oct 22, 2024 | Code GenerationCode Summarization | —Unverified | 0 | 0 |