SOTAVerified

Multiple-choice

Papers

Showing 351400 of 1107 papers

TitleStatusHype
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models0
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and GenerationCode0
AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models0
How much do LLMs learn from negative examples?Code0
LEAVS: An LLM-based Labeler for Abdominal CT SupervisionCode0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory0
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education0
SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAMCode0
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words0
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language ModelsCode0
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations0
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces0
Towards Conversational AI for Disease Management0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense ScenariosCode0
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMsCode0
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework0
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of AbstractionCode0
The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars0
Structured Outputs Enable General-Purpose LLMs to be Medical Experts0
When an LLM is apprehensive about its answers -- and when its uncertainty is justifiedCode0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering0
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts0
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning0
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research AssistantsCode0
ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions0
SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning0
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility ScoresCode0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental HealthcareCode0
LegalBench.PT: A Benchmark for Portuguese Law0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension0
Fundamental Limitations in Defending LLM Finetuning APIs0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora0
Towards Geo-Culturally Grounded LLM Generations0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering0
Show:102550
← PrevPage 8 of 23Next →

No leaderboard results yet.