SOTAVerified

Multiple-choice

Papers

Showing 101150 of 1107 papers

TitleStatusHype
InstructionBench: An Instructional Video Understanding Benchmark0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models0
VEGAS: Towards Visually Explainable and Grounded Artificial Social IntelligenceCode0
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning0
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
Order Independence With Finetuning0
Question-Aware Knowledge Graph Prompting for Enhancing Large Language ModelsCode0
Mobile-MMLU: A Mobile Intelligence Language Understanding BenchmarkCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia0
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and GenerationCode0
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models0
AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models0
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
How much do LLMs learn from negative examples?Code0
LEAVS: An LLM-based Labeler for Abdominal CT SupervisionCode0
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education0
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory0
SeqSAM: Autoregressive Multiple Hypothesis Prediction for Medical Image Segmentation using SAMCode0
Mellow: a small audio language model for reasoningCode2
Identity Lock: Locking API Fine-tuned LLMs With Identity-based Wake Words0
VisBias: Measuring Explicit and Implicit Social Biases in Vision Language ModelsCode0
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations0
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces0
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense ScenariosCode0
Towards Conversational AI for Disease Management0
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework0
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMsCode0
The impact of AI and peer feedback on research writing skills: a study using the CGScholar platform among Kazakhstani scholars0
Analogical Reasoning Inside Large Language Models: Concept Vectors and the Limits of AbstractionCode0
Structured Outputs Enable General-Purpose LLMs to be Medical Experts0
When an LLM is apprehensive about its answers -- and when its uncertainty is justifiedCode0
None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering0
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts0
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational BiologyCode2
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research AssistantsCode0
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning0
ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions0
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More ChallengingCode0
SECURA: Sigmoid-Enhanced CUR Decomposition with Uninterrupted Retention and Low-Rank Adaptation in Large Language Models0
DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning0
Reversal Blessing: Thinking Backward May Outpace Thinking Forward in Multi-choice Questions0
Show:102550
← PrevPage 3 of 23Next →

No leaderboard results yet.