| Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models | Nov 10, 2023 | GSM8KMemorization | CodeCode Available | 1 |
| EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding | Aug 17, 2023 | DiagnosticEgoSchema | CodeCode Available | 1 |
| Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation | Sep 19, 2023 | Language Model EvaluationLanguage Modeling | CodeCode Available | 1 |
| Explicit Planning Helps Language Models in Logical Reasoning | Mar 28, 2023 | Logical ReasoningMultiple-choice | CodeCode Available | 1 |
| InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding | Jun 28, 2024 | Multiple-choiceVideo Understanding | CodeCode Available | 1 |
| ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks | Jul 8, 2024 | Anomaly DetectionCode Generation | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate Models for IRT Assessment | Mar 3, 2024 | Cloze TestMultiple-choice | —Unverified | 0 |
| Contextual Response Interpretation for Automated Structured Interviews: A Case Study in Market Research | Apr 30, 2023 | MarketingMultiple-choice | —Unverified | 0 |
| Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets | Sep 29, 2021 | Language ModellingMachine Reading Comprehension | —Unverified | 0 |
| Context Modeling with Evidence Filter for Multiple Choice Question Answering | Oct 6, 2020 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Jan 16, 2022 | BenchmarkingMultiple-choice | —Unverified | 0 |
| AstroMLab 1: Who Wins Astronomy Jeopardy!? | Jul 15, 2024 | AstronomyBenchmarking | —Unverified | 0 |
| E-Commerce Promotions Personalization via Online Multiple-Choice Knapsack with Uplift Modeling | Aug 11, 2021 | Multiple-choice | —Unverified | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Sep 27, 2021 | BenchmarkingMultiple-choice | —Unverified | 0 |
| A statistical model for aggregating judgments by incorporating peer predictions | Mar 14, 2017 | counterfactualMultiple-choice | —Unverified | 0 |
| Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III | Jun 29, 2025 | Model SelectionMultiple-choice | —Unverified | 0 |
| Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models | Oct 18, 2024 | FairnessMultiple-choice | —Unverified | 0 |
| Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints | May 28, 2024 | Multiple-choiceSentence | —Unverified | 0 |
| Confidence-Aware Learning Assistant | Feb 15, 2021 | Multiple-choice | —Unverified | 0 |
| Comparative Study of Learning Outcomes for Online Learning Platforms | Apr 15, 2021 | Active LearningMultiple-choice | —Unverified | 0 |
| Assessing Large Language Models in Mechanical Engineering Education: A Study on Mechanics-Focused Conceptual Understanding | Jan 13, 2024 | Multiple-choicePrompt Engineering | —Unverified | 0 |
| An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System | Sep 17, 2021 | Multiple-choicesoftware testing | —Unverified | 0 |
| Combining Multiple Cues for Visual Madlibs Question Answering | Nov 1, 2016 | AttributeGeneral Classification | —Unverified | 0 |
| Combinatorial framework for planning in geological exploration | Jan 22, 2018 | AttributeMultiple-choice | —Unverified | 0 |