SOTAVerified

Multiple-choice

Papers

Showing 376400 of 1107 papers

TitleStatusHype
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning0
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research AssistantsCode0
ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions0
SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning0
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility ScoresCode0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental HealthcareCode0
LegalBench.PT: A Benchmark for Portuguese Law0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension0
Fundamental Limitations in Defending LLM Finetuning APIs0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora0
Towards Geo-Culturally Grounded LLM Generations0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering0
Show:102550
← PrevPage 16 of 45Next →

No leaderboard results yet.