SOTAVerified

Multiple-choice

Papers

Showing 276300 of 1107 papers

TitleStatusHype
Multiple Choice Learning for Efficient Speech Separation with Many Speakers0
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis0
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 LanguagesCode1
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Code1
A Benchmark for Long-Form Medical Question AnsweringCode0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineCode0
TRACE: Transformer-based Risk Assessment for Clinical EvaluationCode0
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMsCode0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
StoryTeller: Improving Long Video Description through Global Audio-Visual Character IdentificationCode2
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators0
Quantitative Assessment of Intersectional Empathetic Bias and UnderstandingCode0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding0
HourVideo: 1-Hour Video-Language UnderstandingCode2
MEG: Medical Knowledge-Augmented Large Language Models for Question AnsweringCode1
MILU: A Multi-task Indic Language Understanding BenchmarkCode1
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees0
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
Show:102550
← PrevPage 12 of 45Next →

No leaderboard results yet.