SOTAVerified

Multiple-choice

Papers

Showing 76100 of 1107 papers

TitleStatusHype
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image EditingCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Mobile-MMLU: A Mobile Intelligence Language Understanding BenchmarkCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
ZNO-Eval: Benchmarking reasoning capabilities of large language models in UkrainianCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Unifying Specialized Visual Encoders for Video Language ModelsCode1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?Code1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian LanguagesCode1
Show:102550
← PrevPage 4 of 45Next →

No leaderboard results yet.