SOTAVerified

Multiple-choice

Papers

Showing 101125 of 1107 papers

TitleStatusHype
InstructionBench: An Instructional Video Understanding Benchmark0
Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams0
From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models0
VEGAS: Towards Visually Explainable and Grounded Artificial Social IntelligenceCode0
ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning0
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
Order Independence With Finetuning0
Question-Aware Knowledge Graph Prompting for Enhancing Large Language ModelsCode0
Mobile-MMLU: A Mobile Intelligence Language Understanding BenchmarkCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering0
Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark0
SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia0
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and GenerationCode0
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
AutoDrive-QA- Automated Generation of Multiple-Choice Questions for Autonomous Driving Datasets Using Large Vision-Language Models0
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models0
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models0
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding0
How much do LLMs learn from negative examples?Code0
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
LEAVS: An LLM-based Labeler for Abdominal CT SupervisionCode0
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education0
Chat-TS: Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data0
The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory0
Show:102550
← PrevPage 5 of 45Next →

No leaderboard results yet.