Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–175 of 1107 papers

Title	Date	Tasks	Status	Hype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models	Feb 24, 2025	GSM8KMath	CodeCode Available	2
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models	Feb 24, 2025	Logical ReasoningMultiple-choice	CodeCode Available	1
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own	Feb 23, 2025	Multiple-choice	—Unverified	0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare	Feb 22, 2025	Decision MakingMultiple-choice	CodeCode Available	0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores	Feb 22, 2025	Distractor GenerationInformation Retrieval	CodeCode Available	0
LegalBench.PT: A Benchmark for Portuguese Law	Feb 22, 2025	Multiple-choice	—Unverified	0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Feb 21, 2025	BenchmarkingDiagnostic	—Unverified	0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns	Feb 21, 2025	Distractor GenerationMultiple-choice	—Unverified	0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension	Feb 20, 2025	Multiple-choiceReading Comprehension	—Unverified	0
Fundamental Limitations in Defending LLM Finetuning APIs	Feb 20, 2025	Multiple-choice	—Unverified	0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels	Feb 20, 2025	Multiple-choiceText Generation	—Unverified	0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora	Feb 19, 2025	ArticlesMultiple-choice	—Unverified	0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh	Feb 19, 2025	Instruction FollowingMultiple-choice	—Unverified	0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above	Feb 19, 2025	AllMultiple-choice	—Unverified	0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare	Feb 19, 2025	BenchmarkingDiversity	—Unverified	0
Towards Geo-Culturally Grounded LLM Generations	Feb 19, 2025	Multiple-choiceRetrieval-augmented Generation	—Unverified	0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities	Feb 18, 2025	Large Language ModelMultiple-choice	—Unverified	0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks	Feb 18, 2025	MathMemorization	—Unverified	0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs	Feb 18, 2025	Generative Question AnsweringMultiple-choice	—Unverified	0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering	Feb 17, 2025	Multiple-choiceQuestion Answering	—Unverified	0
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning	Feb 16, 2025	Analogical questionsIn-Context Learning	—Unverified	0
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models	Feb 16, 2025	Multiple-choice	CodeCode Available	1
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified	0
Objective quantification of mood states using large language models	Feb 13, 2025	Multiple-choice	—Unverified	0
Truth Knows No Language: Evaluating Truthfulness Beyond English	Feb 13, 2025	InformativenessMachine Translation	CodeCode Available	0

Show:10 25 50

← PrevPage 7 of 45Next →

No leaderboard results yet.