SOTAVerified

Multiple-choice

Papers

Showing 151175 of 1107 papers

TitleStatusHype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language ModelsCode2
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental HealthcareCode0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility ScoresCode0
LegalBench.PT: A Benchmark for Portuguese Law0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension0
Fundamental Limitations in Defending LLM Finetuning APIs0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
Towards Geo-Culturally Grounded LLM Generations0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering0
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning0
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models0
Objective quantification of mood states using large language models0
Truth Knows No Language: Evaluating Truthfulness Beyond EnglishCode0
Show:102550
← PrevPage 7 of 45Next →

No leaderboard results yet.