| None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering | Mar 3, 2025 | Business EthicsEthics | —Unverified | 0 |
| MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts | Feb 28, 2025 | MathMathematical Reasoning | —Unverified | 0 |
| BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology | Feb 28, 2025 | Multiple-choicescientific discovery | CodeCode Available | 2 |
| EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants | Feb 27, 2025 | Multiple-choice | CodeCode Available | 0 |
| Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning | Feb 27, 2025 | MathMedical Question Answering | —Unverified | 0 |
| ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions | Feb 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging | Feb 25, 2025 | MMLUMultiple-choice | CodeCode Available | 0 |
| SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models | Feb 25, 2025 | Continual LearningGSM8K | —Unverified | 0 |
| DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning | Feb 25, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions | Feb 25, 2025 | Inductive BiasLogical Reasoning | —Unverified | 0 |