| GPT Takes the Bar Exam | Dec 29, 2022 | Hyperparameter OptimizationMultiple-choice | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| Evaluating the Knowledge Dependency of Questions | Nov 21, 2022 | Multiple-choice | CodeCode Available | 1 |
| ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning | Apr 15, 2021 | Graph GenerationMultiple-choice | CodeCode Available | 1 |
| IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models | Mar 23, 2024 | Common Sense ReasoningIn-Context Learning | CodeCode Available | 1 |
| LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs | Aug 16, 2024 | Instruction FollowingMultiple-choice | CodeCode Available | 1 |
| Multiple-Choice Questions are Efficient and Robust LLM Evaluators | May 20, 2024 | GSM8KHumanEval | CodeCode Available | 1 |
| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| Contextual Response Interpretation for Automated Structured Interviews: A Case Study in Market Research | Apr 30, 2023 | MarketingMultiple-choice | —Unverified | 0 |
| Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets | Sep 29, 2021 | Language ModellingMachine Reading Comprehension | —Unverified | 0 |
| Context Modeling with Evidence Filter for Multiple Choice Question Answering | Oct 6, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Jan 16, 2022 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AstroMLab 1: Who Wins Astronomy Jeopardy!? | Jul 15, 2024 | AstronomyBenchmarking | —Unverified | 0 |
| Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth | Jun 8, 2025 | Multiple-choice | —Unverified | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Sep 27, 2021 | BenchmarkingMultiple-choice | —Unverified | 0 |
| A statistical model for aggregating judgments by incorporating peer predictions | Mar 14, 2017 | counterfactualMultiple-choice | —Unverified | 0 |
| Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III | Jun 29, 2025 | Model SelectionMultiple-choice | —Unverified | 0 |
| Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models | Oct 18, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis | Jan 28, 2024 | Knowledge GraphsMedical Diagnosis | —Unverified | 0 |
| Confidence-Aware Learning Assistant | Feb 15, 2021 | Multiple-choice | —Unverified | 0 |
| Comparative Study of Learning Outcomes for Online Learning Platforms | Apr 15, 2021 | Active LearningMultiple-choice | —Unverified | 0 |
| Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding | Jan 13, 2024 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System | Sep 17, 2021 | Multiple-choicesoftware testing | —Unverified | 0 |
| Combining Multiple Cues for Visual Madlibs Question Answering | Nov 1, 2016 | AttributeGeneral Classification | —Unverified | 0 |
| Combinatorial framework for planning in geological exploration | Jan 22, 2018 | AttributeMultiple-choice | —Unverified | 0 |
| Assessing Distractors in Multiple-Choice Tests | Nov 8, 2023 | DiversityMultiple-choice | —Unverified | 0 |
| Assessing AI-Generated Questions' Alignment with Cognitive Frameworks in Educational Assessment | Apr 19, 2025 | ClassificationMultiple-choice | —Unverified | 0 |
| An AI-based Solution for Enhancing Delivery of Digital Learning for Future Teachers | Nov 9, 2021 | Multiple-choiceQuestion Generation | —Unverified | 0 |
| Evaluating LLMs on Document-Based QA: Exact Answer Selection and Numerical Extraction using Cogtale dataset | Nov 14, 2023 | Answer SelectionInformation Retrieval | —Unverified | 0 |
| Collaboration among Multiple Large Language Models for Medical Question Answering | May 22, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |
| Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments | Nov 30, 2024 | Multiple-choice | —Unverified | 0 |
| An Add-On for Empowering Google Forms to be an Automatic Question Generator in Online Assessments | Sep 21, 2021 | Multiple-choice | —Unverified | 0 |
| COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain | May 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models | Mar 20, 2025 | Code GenerationMultiple-choice | —Unverified | 0 |
| A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs | Jun 11, 2025 | Multiple-choice | —Unverified | 0 |
| A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base | Nov 28, 2020 | ClusteringGraph Representation Learning | —Unverified | 0 |
| CoddLLM: Empowering Large Language Models for Data Analytics | Feb 1, 2025 | Multiple-choiceSynthetic Data Generation | —Unverified | 0 |
| A Semantic Parsing Algorithm to Solve Linear Ordering Problems | Feb 12, 2025 | Multiple-choiceSemantic Parsing | —Unverified | 0 |
| A Semantic Feature-Wise Transformation Relation Network for Automatic Short Answer Grading | Nov 1, 2021 | automatic short answer gradingData Augmentation | —Unverified | 0 |
| From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams | Jun 11, 2022 | BIG-bench Machine LearningFew-Shot Learning | —Unverified | 0 |
| Establishing Task Scaling Laws via Compute-Efficient Model Ladders | Dec 5, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Aryl: An Elastic Cluster Scheduler for Deep Learning | Feb 16, 2022 | Deep LearningGPU | —Unverified | 0 |
| Clozer”:" Adaptable Data Augmentation for Cloze-style Reading Comprehension | May 1, 2022 | Data AugmentationMachine Reading Comprehension | —Unverified | 0 |
| Clozer: Adaptable Data Augmentation for Cloze-style Reading Comprehension | Mar 30, 2022 | Data AugmentationMachine Reading Comprehension | —Unverified | 0 |
| Amobee at SemEval-2019 Tasks 5 and 6: Multiple Choice CNN Over Contextual Embedding | Apr 17, 2019 | Multiple-choice | —Unverified | 0 |
| Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation | Jan 1, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering | Jan 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 |
| A Method for Building a Commonsense Inference Dataset based on Basic Events | Nov 1, 2020 | Multiple-choiceTransfer Learning | —Unverified | 0 |
| ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases | May 30, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |
| Enhancing Multiple-Choice Question Answering with Causal Knowledge | Jun 1, 2021 | Multiple-choiceQuestion Answering | —Unverified | 0 |