SOTAVerified

Multiple-choice

Papers

Showing 451500 of 1107 papers

TitleStatusHype
Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering0
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation0
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs0
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation ModelsCode0
A review of faithfulness metrics for hallucination assessment in Large Language Models0
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects0
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
HindiLLM: Large Language Model for Hindi0
Using Large Language Models for Automated Grading of Student Writing about Science0
In Case You Missed It: ARC 'Challenge' Is Not That Challenging0
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Auto-bidding in real-time auctions via Oracle Imitation Learning (OIL)0
Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models0
Superhuman performance of a large language model on the reasoning tasks of a physician0
MedG-KRP: Medical Graph Knowledge Representation ProbingCode0
A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options0
Do LLMs Act as Repositories of Causal Knowledge?0
A multimodal dataset for understanding the impact of mobile phones on remote online virtual educationCode0
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing0
Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCTCode0
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering0
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal ModelsCode0
ACQ: A Unified Framework for Automated Programmatic Creativity in Online Advertising0
Evaluating and Mitigating Social Bias for Large Language Models in Open-ended SettingsCode0
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning DistractorCode0
MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects0
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering0
Establishing Task Scaling Laws via Compute-Efficient Model Ladders0
The use of large language models to enhance cancer clinical trial educational materials0
Unlocking Video-LLM via Agent-of-Thoughts Distillation0
Noise Injection Reveals Hidden Capabilities of Sandbagging Language ModelsCode0
KnowledgePrompts: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced PromptingCode0
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages0
Cognitive Biases in Large Language Models: A Survey and Mitigation Experiments0
Perception Test 2024: Challenge Summary and a Novel Hour-Long VideoQA Benchmark0
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers0
Applying IRT to Distinguish Between Human and Generative AI Responses to Multiple-Choice Assessments0
Multiple Choice Learning for Efficient Speech Separation with Many Speakers0
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?0
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text0
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis0
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset0
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation0
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning0
A Benchmark for Long-Form Medical Question AnsweringCode0
DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in BiomedicineCode0
TRACE: Transformer-based Risk Assessment for Clinical EvaluationCode0
Show:102550
← PrevPage 10 of 23Next →

No leaderboard results yet.