| Enhancing LLM Evaluations: The Garbling Trick | Nov 3, 2024 | Multiple-choice | —Unverified | 0 |
| Answering Chinese Elementary School Social Study Multiple Choice Questions | Jun 26, 2021 | Multiple-choiceNegation | —Unverified | 0 |
| First Token Probability Guided RAG for Telecom Question Answering | Jan 11, 2025 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| Enhancing Event Causality Identification with Rationale and Structure-Aware Causal Question Answering | Mar 17, 2024 | Event Causality IdentificationMultiple-choice | —Unverified | 0 |
| Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation | May 15, 2025 | InformativenessMultiple-choice | —Unverified | 0 |
| Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration | Jun 19, 2024 | BenchmarkingDistractor Generation | —Unverified | 0 |
| Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination | Sep 19, 2024 | General KnowledgeMMLU | —Unverified | 0 |
| ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data | May 2, 2020 | Knowledge GraphsLanguage Modelling | —Unverified | 0 |
| FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models | Apr 29, 2024 | Common Sense ReasoningMultiple-choice | —Unverified | 0 |
| AGReE: A system for generating Automated Grammar Reading Exercises | Oct 28, 2022 | ArticlesMultiple-choice | —Unverified | 0 |
| Framing QA as Building and Ranking Intersentence Answer Justifications | Jun 1, 2017 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models | Apr 4, 2025 | Multiple-choice | —Unverified | 0 |
| From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project | Sep 4, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents | Nov 12, 2024 | General KnowledgeHallucination | —Unverified | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| Humanity's Last Exam | Jan 24, 2025 | Humanity's Last ExamLanguage Modeling | —Unverified | 0 |
| End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering | Oct 10, 2016 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Fundamental Limitations in Defending LLM Finetuning APIs | Feb 20, 2025 | Multiple-choice | —Unverified | 0 |
| Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents | Apr 5, 2024 | Multiple-choiceNavigate | —Unverified | 0 |
| FusionMind -- Improving question and answering with external context fusion | Dec 31, 2023 | Knowledge GraphsMultiple-choice | —Unverified | 0 |
| Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework | Jan 16, 2025 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions | Oct 24, 2020 | General ClassificationMultiple-choice | —Unverified | 0 |
| LLMs May Perform MCQA by Selecting the Least Incorrect Option | Feb 2, 2024 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 |
| ELiRF-UPV at SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge | Jun 1, 2018 | Multiple-choiceQuestion Answering | —Unverified | 0 |