SOTAVerified

Multiple-choice

Papers

Showing 501550 of 1107 papers

TitleStatusHype
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMsCode0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators0
Quantitative Assessment of Intersectional Empathetic Bias and UnderstandingCode0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding0
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees0
Enhancing LLM Evaluations: The Garbling Trick0
Benchmarking Bias in Large Language Models during Role-Playing0
R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest0
GPT-4o System Card0
Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare0
Large Language Models Still Exhibit Bias in Long Text0
GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks0
How Can We Diagnose and Treat Bias in Large Language Models for Clinical Decision-Making?Code0
Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the U.S0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models0
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights0
CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy0
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison FeedbackCode0
Evaluating the Instruction-following Abilities of Language Models using Knowledge TasksCode0
Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answersCode0
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMsCode0
Personalised Feedback Framework for Online Education Programmes Using Generative AI0
Not All Options Are Created Equal: Textual Option Weighting for Token-Efficient LLM-Based Knowledge Tracing0
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models0
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language ModelsCode0
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models0
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language ModelsCode0
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models0
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models0
TVBench: Redesigning Video-Language Evaluation0
Answering Questions in Stages: Prompt Chaining for Contract QA0
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction TuningCode0
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition0
ACPBench: Reasoning about Action, Change, and Planning0
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense ReasoningCode0
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA0
Video Instruction Tuning With Synthetic Data0
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language ModelsCode0
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit ScalesCode0
Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model0
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs0
Mitigating Selection Bias with Node Pruning and Auxiliary Options0
DisGeM: Distractor Generation for Multiple Choice Questions with Span MaskingCode0
DARE: Diverse Visual Question Answering with Robustness Evaluation0
LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ0
RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation0
Show:102550
← PrevPage 11 of 23Next →

No leaderboard results yet.