SOTAVerified

Multiple-choice

Papers

Showing 201250 of 1107 papers

TitleStatusHype
On the Reasoning Capacity of AI Models and How to Quantify It0
The AI Penalization Effect: People Reduce Compensation for Workers Who Use AI0
Patent Figure Classification using Large Vision-language ModelsCode0
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction0
MedS^3: Towards Medical Small Language Models with Self-Evolved Slow ThinkingCode2
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!0
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong0
Vision-Language Models Do Not Understand Negation0
Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework0
Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History0
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
Rethinking AI Cultural Alignment0
Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation0
ZNO-Eval: Benchmarking reasoning capabilities of large language models in UkrainianCode1
First Token Probability Guided RAG for Telecom Question Answering0
Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language UnderstandingCode0
Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQsCode0
Knowledge Retrieval Based on Generative AI0
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests0
Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States0
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and ChallengesCode0
CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering0
Unifying Specialized Visual Encoders for Video Language ModelsCode1
Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation0
Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs0
Separation of Powers: On Segregating Knowledge from Observation in LLM-enabled Knowledge-based Visual Question Answering0
FSBench: A Figure Skating Benchmark for Advancing Artistic Sports UnderstandingCode0
IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models0
EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta0
A review of faithfulness metrics for hallucination assessment in Large Language Models0
Setting Standards in Turkish NLP: TR-MMLU for Large Language Model Evaluation0
AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects0
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation ModelsCode0
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs0
SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity0
HindiLLM: Large Language Model for Hindi0
Using Large Language Models for Automated Grading of Student Writing about Science0
In Case You Missed It: ARC 'Challenge' Is Not That Challenging0
MMLU-CF: A Contamination-free Multi-task Language Understanding BenchmarkCode2
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context MultitasksCode5
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation0
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding0
Auto-bidding in real-time auctions via Oracle Imitation Learning (OIL)0
Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models0
MedG-KRP: Medical Graph Knowledge Representation ProbingCode0
Do LLMs Act as Repositories of Causal Knowledge?0
A recent evaluation on the performance of LLMs on radiation oncology physics using questions of randomly shuffled options0
Superhuman performance of a large language model on the reasoning tasks of a physician0
Show:102550
← PrevPage 5 of 23Next →

No leaderboard results yet.