| SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity | Dec 30, 2024 | BenchmarkingCode Generation | —Unverified | 0 | 0 |
| SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models | Feb 25, 2025 | Continual LearningGSM8K | —Unverified | 0 | 0 |
| Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III | Jun 29, 2025 | Model SelectionMultiple-choice | —Unverified | 0 | 0 |
| Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models | Oct 18, 2024 | FairnessMultiple-choice | —Unverified | 0 | 0 |
| From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams | Jun 11, 2022 | BIG-bench Machine LearningFew-Shot Learning | —Unverified | 0 | 0 |
| A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base | Nov 28, 2020 | ClusteringGraph Representation Learning | —Unverified | 0 | 0 |
| Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models | Dec 15, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Selective Particle Attention: Visual Feature-Based Attention in Deep Reinforcement Learning | Aug 26, 2020 | Deep Reinforcement LearningMultiple-choice | —Unverified | 0 | 0 |
| Self-Evaluation Improves Selective Generation in Large Language Models | Dec 14, 2023 | Multiple-choiceTruthfulQA | —Unverified | 0 | 0 |
| Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory | May 2, 2025 | Multiple-choice | —Unverified | 0 | 0 |
| Self-supervised pre-training and contrastive representation learning for multiple-choice video QA | Sep 17, 2020 | Auxiliary LearningContrastive Learning | —Unverified | 0 | 0 |
| Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data | Feb 1, 2021 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Semi-automatic Generation of Multiple-Choice Tests from Mentions of Semantic Relations | Jul 1, 2015 | Multiple-choiceReading Comprehension | —Unverified | 0 | 0 |
| Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering | Jan 1, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Set-LLM: A Permutation-Invariant LLM | May 21, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation | Dec 31, 2024 | Language Model EvaluationLanguage Modeling | —Unverified | 0 | 0 |
| Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions | Apr 11, 2022 | Multiple-choiceReading Comprehension | —Unverified | 0 | 0 |
| Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations | Mar 10, 2025 | FormMultiple-choice | —Unverified | 0 | 0 |
| Social IQa: Commonsense Reasoning about Social Interactions | Nov 1, 2019 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Solving Visual Madlibs with Multiple Cues | Aug 11, 2016 | Activity PredictionAttribute | —Unverified | 0 | 0 |
| SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | May 27, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers | Nov 28, 2024 | Image Captioningimage-classification | —Unverified | 0 | 0 |
| Spending Money Wisely: Online Electronic Coupon Allocation based on Real-Time User Intent Detection | Aug 23, 2020 | Intent DetectionMultiple-choice | —Unverified | 0 | 0 |
| VUDG: A Dataset for Video Understanding Domain Generalization | May 30, 2025 | Domain GeneralizationMultiple-choice | —Unverified | 0 | 0 |
| SPRITE: A Response Model For Multiple Choice Testing | Jan 12, 2015 | modelMultiple-choice | —Unverified | 0 | 0 |
| Weighted Global Normalization for Multiple Choice Reading Comprehension over Long Documents | Dec 5, 2018 | Answer SelectionMultiple-choice | —Unverified | 0 | 0 |
| Recent Advances in Multi-Choice Machine Reading Comprehension: A Survey on Methods and Datasets | Aug 4, 2024 | Few-Shot LearningMachine Reading Comprehension | —Unverified | 0 | 0 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 | 0 |
| Statistically Profiling Biases in Natural Language Reasoning Datasets and Models | Feb 9, 2021 | Multiple-choiceNatural Language Understanding | —Unverified | 0 | 0 |
| Adaptive Crowdsourcing Algorithms for the Bandit Survey Problem | Feb 13, 2013 | Information RetrievalMultiple-choice | —Unverified | 0 | 0 |
| Stick to your Role! Stability of Personal Values Expressed in Large Language Models | Feb 19, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles | Jun 24, 2016 | Multiple-choice | —Unverified | 0 | 0 |
| Adapting Vision-Language Models for Evaluating World Models | Jun 22, 2025 | Action RecognitionMultimodal Reasoning | —Unverified | 0 | 0 |
| Strategyproof Mean Estimation from Multiple-Choice Questions | Jan 1, 2020 | Multiple-choice | —Unverified | 0 | 0 |
| Structured Outputs Enable General-Purpose LLMs to be Medical Experts | Mar 5, 2025 | Clinical KnowledgeMedical Question Answering | —Unverified | 0 | 0 |
| What does BERT Learn from Multiple-Choice Reading Comprehension Datasets? | Oct 28, 2019 | Multiple-choiceReading Comprehension | —Unverified | 0 | 0 |
| Superhuman performance of a large language model on the reasoning tasks of a physician | Dec 14, 2024 | DiagnosticLanguage Modeling | —Unverified | 0 | 0 |
| What do we expect from Multiple-choice QA Systems? | Nov 20, 2020 | Multiple-choiceMultiple Choice Question Answering (MCQA) | —Unverified | 0 | 0 |
| What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets | Jul 7, 2020 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S | Oct 21, 2024 | Multiple-choice | —Unverified | 0 | 0 |
| SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference | Aug 16, 2018 | Common Sense ReasoningMultiple-choice | —Unverified | 0 | 0 |
| SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages | Jun 20, 2024 | Language ModellingLarge Language Model | —Unverified | 0 | 0 |
| TabMCQ: A Dataset of General Knowledge Tables and Multiple-choice Questions | Feb 12, 2016 | General KnowledgeMultiple-choice | —Unverified | 0 | 0 |
| TA-MAMC at SemEval-2021 Task 4: Task-adaptive Pretraining and Multi-head Attention for Abstract Meaning Reading Comprehension | Aug 1, 2021 | Contrastive LearningMultiple-choice | —Unverified | 0 | 0 |
| Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling | Sep 30, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine | May 29, 2025 | DiagnosticMultiple-choice | —Unverified | 0 | 0 |
| Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted | May 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Empowering Sentence Encoders with Prompting and Label Retrieval for Zero-shot Text Classification | Dec 20, 2022 | ClassificationDescriptive | —Unverified | 0 | 0 |
| Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning | Nov 18, 2024 | Logical ReasoningMultiple-choice | —Unverified | 0 | 0 |
| Answering Chinese Elementary School Social Studies Multiple Choice Questions | Dec 1, 2021 | Multiple-choice | —Unverified | 0 | 0 |