SOTAVerified

Multiple-choice

Papers

Showing 251300 of 1107 papers

TitleStatusHype
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing0
A multimodal dataset for understanding the impact of mobile phones on remote online virtual educationCode0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering0
Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCTCode0
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal ModelsCode0
Evaluating and Mitigating Social Bias for Large Language Models in Open-ended SettingsCode0
ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising0
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning DistractorCode0
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects0
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering0
Establishing Task Scaling Laws via Compute-Efficient Model Ladders0
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?Code1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian LanguagesCode1
The use of large language models to enhance cancer clinical trial educational materials0
Unlocking Video-LLM via Agent-of-Thoughts Distillation0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCode0
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages0
KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced PromptingCode0
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric InformationCode1
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments0
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark0
Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments0
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers0
Multiple Choice Learning for Efficient Speech Separation with Many Speakers0
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language ModelsCode1
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis0
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 LanguagesCode1
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?Code1
A Benchmark for Long-Form Medical Question AnsweringCode0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineCode0
TRACE: Transformer-based Risk Assessment for Clinical EvaluationCode0
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMsCode0
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing Agents0
StoryTeller: Improving Long Video Description through Global Audio-Visual Character IdentificationCode2
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability0
Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators0
Quantitative Assessment of Intersectional Empathetic Bias and UnderstandingCode0
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding0
HourVideo: 1-Hour Video-Language UnderstandingCode2
MEG: Medical Knowledge-Augmented Large Language Models for Question AnsweringCode1
MILU: A Multi-task Indic Language Understanding BenchmarkCode1
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees0
PPLLaVA: Varied Video Sequence Understanding With Prompt GuidanceCode2
Show:102550
← PrevPage 6 of 23Next →

No leaderboard results yet.