| CinePile: A Long Video Question Answering Dataset and Benchmark | May 14, 2024 | FormHuman-Object Interaction Detection | —Unverified | 0 | 0 |
| Cleared for Takeoff? Compositional & Conditional Reasoning may be the Achilles Heel to (Flight-Booking) Language Agents | Apr 5, 2024 | Multiple-choiceNavigate | —Unverified | 0 | 0 |
| ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases | May 30, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| An Experimental Study of Deep Neural Network Models for Vietnamese Multiple-Choice Reading Comprehension | Aug 20, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering | Jan 2, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Clozer: Adaptable Data Augmentation for Cloze-style Reading Comprehension | Mar 30, 2022 | Data AugmentationMachine Reading Comprehension | —Unverified | 0 | 0 |
| Clozer”:" Adaptable Data Augmentation for Cloze-style Reading Comprehension | May 1, 2022 | Data AugmentationMachine Reading Comprehension | —Unverified | 0 | 0 |
| Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge | Feb 5, 2021 | AI2 Reasoning ChallengeARC | —Unverified | 0 | 0 |
| A New Era: Intelligent Tutoring Systems Will Transform Online Learning for Millions | Mar 3, 2022 | Active LearningMultiple-choice | —Unverified | 0 | 0 |
| CoddLLM: Empowering Large Language Models for Data Analytics | Feb 1, 2025 | Multiple-choiceSynthetic Data Generation | —Unverified | 0 | 0 |
| CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models | Mar 20, 2025 | Code GenerationMultiple-choice | —Unverified | 0 | 0 |
| COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain | May 17, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments | Nov 30, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Collaboration among Multiple Large Language Models for Medical Question Answering | May 22, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 | 0 |
| Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses | Jun 15, 2023 | Multiple-choice | —Unverified | 0 | 0 |
| Combinatorial framework for planning in geological exploration | Jan 22, 2018 | AttributeMultiple-choice | —Unverified | 0 | 0 |
| Combining Multiple Cues for Visual Madlibs Question Answering | Nov 1, 2016 | AttributeGeneral Classification | —Unverified | 0 | 0 |
| Comparative Study of Learning Outcomes for Online Learning Platforms | Apr 15, 2021 | Active LearningMultiple-choice | —Unverified | 0 | 0 |
| Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding | Jun 17, 2025 | Multiple-choiceNatural Language Inference | —Unverified | 0 | 0 |
| Confidence-Aware Learning Assistant | Feb 15, 2021 | Multiple-choice | —Unverified | 0 | 0 |
| You Can Do Better! If You Elaborate the Reason When Making Prediction | Mar 27, 2021 | Multiple-choiceNatural Language Inference | —Unverified | 0 | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Sep 27, 2021 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Jan 16, 2022 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Context Modeling with Evidence Filter for Multiple Choice Question Answering | Oct 6, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Contextual Response Interpretation for Automated Structured Interviews: A Case Study in Market Research | Apr 30, 2023 | MarketingMultiple-choice | —Unverified | 0 | 0 |
| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 | 0 |
| Conversational Assistants and Gender Stereotypes: Public Perceptions and Desiderata for Voice Personas | Dec 1, 2020 | Multiple-choice | —Unverified | 0 | 0 |
| ACPBench: Reasoning about Action, Change, and Planning | Oct 8, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Convolutional Spatial Attention Model for Reading Comprehension with Multiple-Choice Questions | Nov 21, 2018 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning | Aug 31, 2019 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| CP-Router: An Uncertainty-Aware Router Between LLM and LRM | May 26, 2025 | Conformal PredictionLogical Reasoning | —Unverified | 0 | 0 |
| Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia | Sep 13, 2024 | MathMultiple-choice | —Unverified | 0 | 0 |
| CroaTPAS: A Survey-based Evaluation | Jun 1, 2022 | Multiple-choiceSurvey | —Unverified | 0 | 0 |
| Template Filling for Controllable Commonsense Reasoning | Oct 31, 2021 | Multiple-choice | —Unverified | 0 | 0 |
| Crowd Labeling: a survey | Jan 13, 2013 | Multiple-choiceSurvey | —Unverified | 0 | 0 |
| Crowdsourcing Multiple Choice Science Questions | Jul 19, 2017 | DiversityMultiple-choice | —Unverified | 0 | 0 |
| CS-NLP team at SemEval-2020 Task 4: Evaluation of State-of-the-art NLP Deep Learning Architectures on Commonsense Reasoning Task | May 17, 2020 | Multiple-choiceNatural Language Inference | —Unverified | 0 | 0 |
| CSReader at SemEval-2018 Task 11: Multiple Choice Question Answering as Textual Entailment | Jun 1, 2018 | Common Sense ReasoningLanguage Modelling | —Unverified | 0 | 0 |
| Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark | Feb 10, 2025 | MMLUMorphological Analysis | —Unverified | 0 | 0 |
| A Neural Question Answering Model Based on Semi-Structured Tables | Aug 1, 2018 | Knowledge GraphsMultiple-choice | —Unverified | 0 | 0 |
| Zero-shot Event Causality Identification with Question Answering | Sep 1, 2022 | ArticlesEvent Causality Identification | —Unverified | 0 | 0 |
| DARE: Diverse Visual Question Answering with Robustness Evaluation | Sep 26, 2024 | image-classificationImage Classification | —Unverified | 0 | 0 |
| ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning | Mar 31, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond | Oct 23, 2023 | counterfactualMultiple-choice | —Unverified | 0 | 0 |
| Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context | Jun 10, 2024 | Decision MakingMultiple-choice | —Unverified | 0 | 0 |
| Deep learning for sentence clustering in essay grading support | Apr 23, 2021 | ClusteringDeep Learning | —Unverified | 0 | 0 |
| DeepQR: Neural-based Quality Ratings for Learnersourced Multiple-Choice Questions | Nov 19, 2021 | Contrastive LearningMultiple-choice | —Unverified | 0 | 0 |
| DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning | Feb 25, 2025 | ManagementMultiple-choice | —Unverified | 0 | 0 |
| Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models | Dec 1, 2020 | Multiple-choiceNatural Language Understanding | —Unverified | 0 | 0 |
| DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding | Oct 24, 2023 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |