SOTAVerified

Multiple-choice

Papers

Showing 5175 of 1107 papers

TitleStatusHype
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language ModelsCode2
All in One: Exploring Unified Video-Language Pre-trainingCode2
Self-Reflection in LLM Agents: Effects on Problem-Solving PerformanceCode2
Steering Llama 2 via Contrastive Activation AdditionCode2
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language ModelsCode2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
Towards Evaluating and Building Versatile Large Language Models for MedicineCode2
Mellow: a small audio language model for reasoningCode2
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language UnderstandingCode2
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational BiologyCode2
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical ExamsCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question AnsweringCode2
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam GenerationCode2
HourVideo: 1-Hour Video-Language UnderstandingCode2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringCode2
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language ModelsCode2
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1Code2
GPQA: A Graduate-Level Google-Proof Q&A BenchmarkCode2
Perception Test: A Diagnostic Benchmark for Multimodal ModelsCode2
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language UnderstandingCode1
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language ModelsCode1
Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset and Comprehensive FrameworkCode1
Show:102550
← PrevPage 3 of 45Next →

No leaderboard results yet.