| AstroMLab 1: Who Wins Astronomy Jeopardy!? | Jul 15, 2024 | AstronomyBenchmarking | —Unverified | 0 | 0 |
| Analysing the Effect of Masking Length Distribution of MLM: An Evaluation Framework and Case Study on Chinese MRC Datasets | Sep 29, 2021 | Language ModellingMachine Reading Comprehension | —Unverified | 0 | 0 |
| HRCA+: Advanced Multiple-choice Machine Reading Comprehension Method | Jun 1, 2022 | Machine Reading ComprehensionMultiple-choice | —Unverified | 0 | 0 |
| Context-guided Triple Matching for Multiple Choice Question Answering | Sep 27, 2021 | BenchmarkingMultiple-choice | —Unverified | 0 | 0 |
| How well do LLMs reason over tabular data, really? | May 12, 2025 | Missing ValuesMultiple-choice | —Unverified | 0 | 0 |
| How Susceptible are LLMs to Influence in Prompts? | Aug 17, 2024 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |
| How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels | Nov 1, 2014 | Multiple-choice | —Unverified | 0 | 0 |
| A statistical model for aggregating judgments by incorporating peer predictions | Mar 14, 2017 | counterfactualMultiple-choice | —Unverified | 0 | 0 |
| Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III | Jun 29, 2025 | Model SelectionMultiple-choice | —Unverified | 0 | 0 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? | Jun 19, 2025 | Multiple-choiceQuestion Answering | —Unverified | 0 | 0 |