| Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions | Feb 25, 2025 | Inductive BiasLogical Reasoning | —Unverified | 0 |
| DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning | Feb 25, 2025 | ManagementMultiple-choice | —Unverified | 0 |
| The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own | Feb 23, 2025 | Multiple-choice | —Unverified | 0 |
| Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores | Feb 22, 2025 | Distractor GenerationInformation Retrieval | CodeCode Available | 0 |
| LegalBench.PT: A Benchmark for Portuguese Law | Feb 22, 2025 | Multiple-choice | —Unverified | 0 |
| Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare | Feb 22, 2025 | Decision MakingMultiple-choice | CodeCode Available | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns | Feb 21, 2025 | Distractor GenerationMultiple-choice | —Unverified | 0 |
| Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension | Feb 20, 2025 | Multiple-choiceReading Comprehension | —Unverified | 0 |
| MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels | Feb 20, 2025 | Multiple-choiceText Generation | —Unverified | 0 |