SOTAVerified

Multiple-choice

Papers

Showing 51100 of 1107 papers

TitleStatusHype
Understanding Long Videos with Multimodal Language ModelsCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Steering Llama 2 via Contrastive Activation AdditionCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in SummarizationCode2
Perception Test: A Diagnostic Benchmark for Multimodal ModelsCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question AnsweringCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical ExamsCode2
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous DrivingCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in KoreanCode1
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across ModalitiesCode1
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning FrameworkCode1
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?Code1
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image EditingCode1
Ranked Voting based Self-Consistency of Large Language ModelsCode1
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning EvaluationCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
Assessing the Chemical Intelligence of Large Language ModelsCode1
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question AnsweringCode1
Mobile-MMLU: A Mobile Intelligence Language Understanding BenchmarkCode1
Language Model Uncertainty Quantification with Attention ChainCode1
Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language ModelsCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
CUPCase: Clinically Uncommon Patient Cases and Diagnoses DatasetCode1
AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language ModelsCode1
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language ModelsCode1
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic ScenesCode1
FaceXBench: Evaluating Multimodal LLMs on Face UnderstandingCode1
ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of MindCode1
ZNO-Eval: Benchmarking reasoning capabilities of large language models in UkrainianCode1
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model EvaluationCode1
Unifying Specialized Visual Encoders for Video Language ModelsCode1
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph CompletionCode1
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?Code1
SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian LanguagesCode1
Show:102550
← PrevPage 2 of 23Next →

No leaderboard results yet.