SOTAVerified

Multiple-choice

Papers

Showing 5175 of 1107 papers

TitleStatusHype
Neptune: The Long Orbit to Benchmarking Long Video UnderstandingCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
SafetyBench: Evaluating the Safety of Large Language ModelsCode2
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language VariantsCode2
Mellow: a small audio language model for reasoningCode2
Towards Evaluating and Building Versatile Large Language Models for MedicineCode2
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal ModelsCode2
VAU-R1: Advancing Video Anomaly Understanding via Reinforcement Fine-TuningCode2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question AnsweringCode2
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real WorldCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational BiologyCode2
MedS^3: Towards Medical Small Language Models with Self-Evolved Slow ThinkingCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
HourVideo: 1-Hour Video-Language UnderstandingCode2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language UnderstandingCode2
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive DiversityCode2
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training StrategiesCode1
BiMediX: Bilingual Medical Mixture of Experts LLMCode1
Show:102550
← PrevPage 3 of 45Next →

No leaderboard results yet.