Multiple-choice

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 1107 papers

Title	Date	Tasks	Status	Hype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models	Feb 24, 2025	GSM8KMath	CodeCode Available	2
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models	Feb 24, 2025	Logical ReasoningMultiple-choice	CodeCode Available	1
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own	Feb 23, 2025	Multiple-choice	—Unverified	0
LegalBench.PT: A Benchmark for Portuguese Law	Feb 22, 2025	Multiple-choice	—Unverified	0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare	Feb 22, 2025	Decision MakingMultiple-choice	CodeCode Available	0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores	Feb 22, 2025	Distractor GenerationInformation Retrieval	CodeCode Available	0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Feb 21, 2025	BenchmarkingDiagnostic	—Unverified	0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns	Feb 21, 2025	Distractor GenerationMultiple-choice	—Unverified	0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension	Feb 20, 2025	Multiple-choiceReading Comprehension	—Unverified	0
Fundamental Limitations in Defending LLM Finetuning APIs	Feb 20, 2025	Multiple-choice	—Unverified	0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels	Feb 20, 2025	Multiple-choiceText Generation	—Unverified	0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora	Feb 19, 2025	ArticlesMultiple-choice	—Unverified	0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh	Feb 19, 2025	Instruction FollowingMultiple-choice	—Unverified	0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above	Feb 19, 2025	AllMultiple-choice	—Unverified	0
Towards Geo-Culturally Grounded LLM Generations	Feb 19, 2025	Multiple-choiceRetrieval-augmented Generation	—Unverified	0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare	Feb 19, 2025	BenchmarkingDiversity	—Unverified	0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities	Feb 18, 2025	Large Language ModelMultiple-choice	—Unverified	0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks	Feb 18, 2025	MathMemorization	—Unverified	0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs	Feb 18, 2025	Generative Question AnsweringMultiple-choice	—Unverified	0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering	Feb 17, 2025	Multiple-choiceQuestion Answering	—Unverified	0
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning	Feb 16, 2025	Analogical questionsIn-Context Learning	—Unverified	0
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models	Feb 16, 2025	Multiple-choice	CodeCode Available	1
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Feb 14, 2025	Image CaptioningLarge Language Model	—Unverified	0
Objective quantification of mood states using large language models	Feb 13, 2025	Multiple-choice	—Unverified	0
Truth Knows No Language: Evaluating Truthfulness Beyond English	Feb 13, 2025	InformativenessMachine Translation	CodeCode Available	0
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models	Feb 12, 2025	FairnessMultiple-choice	—Unverified	0
A Semantic Parsing Algorithm to Solve Linear Ordering Problems	Feb 12, 2025	Multiple-choiceSemantic Parsing	—Unverified	0
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs	Feb 12, 2025	Multiple-choiceSurvey	—Unverified	0
PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian	Feb 11, 2025	Multiple-choice	—Unverified	0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark	Feb 10, 2025	MMLUMorphological Analysis	—Unverified	0
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models	Feb 9, 2025	Answer GenerationLanguage Modeling	CodeCode Available	0
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning	Feb 8, 2025	Legal ReasoningMultiple-choice	CodeCode Available	0
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning	Feb 7, 2025	Multiple-choiceQuestion Answering	CodeCode Available	0
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs	Feb 6, 2025	Multiple-choiceSensitivity	—Unverified	0
LLMs to Support a Domain Specific Knowledge Assistant	Feb 6, 2025	ChatbotMultiple-choice	—Unverified	0
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes	Feb 4, 2025	Autonomous DrivingMultiple-choice	CodeCode Available	1
Evalita-LLM: Benchmarking Large Language Models on Italian	Feb 4, 2025	BenchmarkingMultiple-choice	—Unverified	0
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts	Feb 3, 2025	Multiple-choiceReading Comprehension	—Unverified	0
CoddLLM: Empowering Large Language Models for Data Analytics	Feb 1, 2025	Multiple-choiceSynthetic Data Generation	—Unverified	0
InnerThoughts: Disentangling Representations and Predictions in Large Language Models	Jan 29, 2025	Multiple-choicePosition	—Unverified	0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction	Jan 28, 2025	Logical ReasoningMultiple-choice	—Unverified	0
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection	Jan 28, 2025	Multiple-choice	—Unverified	0
Attribution analysis of legal language as used by LLM	Jan 28, 2025	Binary ClassificationMultiple-choice	—Unverified	0
Options-Aware Dense Retrieval for Multiple-Choice query Answering	Jan 27, 2025	Multiple-choiceQuestion Answering	—Unverified	0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI	Jan 26, 2025	MMLUMultiple-choice	—Unverified	0
LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering	Jan 25, 2025	Information RetrievalMultiple-choice	—Unverified	0
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion	Jan 25, 2025	Multiple-choiceReading Comprehension	—Unverified	0
Option-ID Based Elimination For Multiple Choice Questions	Jan 25, 2025	Multiple-choice	CodeCode Available	0
Humanity's Last Exam	Jan 24, 2025	Humanity's Last ExamLanguage Modeling	—Unverified	0
Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources	Jan 23, 2025	Multiple-choice	—Unverified	0

Show:10 25 50

← PrevPage 4 of 23Next →

No leaderboard results yet.