SOTAVerified

Multiple-choice

Papers

Showing 326350 of 1107 papers

TitleStatusHype
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language ModelsCode0
Taming Overconfidence in LLMs: Reward Calibration in RLHFCode1
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models0
NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language ModelsCode0
SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language ModelsCode1
Sample then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models0
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models0
TVBench: Redesigning Video-Language Evaluation0
Answering Questions in Stages: Prompt Chaining for Contract QA0
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction TuningCode0
ACPBench: Reasoning about Action, Change, and Planning0
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition0
Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense ReasoningCode0
Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA0
Video Instruction Tuning With Synthetic Data0
Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit ScalesCode0
MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE FrameworkCode1
DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language ModelsCode0
Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model0
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense ReasoningCode1
Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling0
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs0
Mitigating Selection Bias with Node Pruning and Auxiliary Options0
DisGeM: Distractor Generation for Multiple Choice Questions with Span MaskingCode0
DARE: Diverse Visual Question Answering with Robustness Evaluation0
Show:102550
← PrevPage 14 of 45Next →

No leaderboard results yet.