SOTAVerified

Multiple-choice

Papers

Showing 151200 of 1107 papers

TitleStatusHype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language ModelsCode2
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own0
LegalBench.PT: A Benchmark for Portuguese Law0
Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental HealthcareCode0
Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility ScoresCode0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns0
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension0
Fundamental Limitations in Defending LLM Finetuning APIs0
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels0
Is This Collection Worth My LLM's Time? Automatically Measuring Information Potential in Text Corpora0
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh0
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above0
Towards Geo-Culturally Grounded LLM Generations0
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare0
OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities0
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks0
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs0
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering0
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning0
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models0
Objective quantification of mood states using large language models0
Truth Knows No Language: Evaluating Truthfulness Beyond EnglishCode0
SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models0
A Semantic Parsing Algorithm to Solve Linear Ordering Problems0
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs0
PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian0
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark0
HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language ModelsCode0
Investigating the Shortcomings of LLMs in Step-by-Step Legal ReasoningCode0
ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and ReasoningCode0
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs0
LLMs to Support a Domain Specific Knowledge Assistant0
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
Evalita-LLM: Benchmarking Large Language Models on Italian0
The Use of Artificial Intelligence Tools in Assessing Content Validity: A Comparative Study with Human Experts0
CoddLLM: Empowering Large Language Models for Data Analytics0
InnerThoughts: Disentangling Representations and Predictions in Large Language Models0
Town Hall Debate Prompting: Enhancing Logical Reasoning in LLMs through Multi-Persona Interaction0
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection0
Attribution analysis of legal language as used by LLM0
Options-Aware Dense Retrieval for Multiple-Choice query Answering0
HardML: A Benchmark For Evaluating Data Science And Machine Learning knowledge and reasoning in AI0
LLM Evaluation Based on Aerospace Manufacturing Expertise: Automated Generation and Multi-Model Question Answering0
LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion0
Option-ID Based Elimination For Multiple Choice QuestionsCode0
Humanity's Last Exam0
Auto-Evaluation: A Critical Measure in Driving Improvements in Quality and Safety of AI-Generated Lesson Resources0
Show:102550
← PrevPage 4 of 23Next →

No leaderboard results yet.