SOTAVerified

Multiple-choice

Papers

Showing 5175 of 1107 papers

TitleStatusHype
Understanding Long Videos with Multimodal Language ModelsCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Steering Llama 2 via Contrastive Activation AdditionCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingCode2
SEED-Bench: Benchmarking Multimodal LLMs with Generative ComprehensionCode2
MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in SummarizationCode2
Perception Test: A Diagnostic Benchmark for Multimodal ModelsCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question AnsweringCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical ExamsCode2
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous DrivingCode1
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician ValidationCode1
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in KoreanCode1
Show:102550
← PrevPage 3 of 45Next →

No leaderboard results yet.