SOTAVerified

Multiple-choice

Papers

Showing 5160 of 1107 papers

TitleStatusHype
Understanding Long Videos with Multimodal Language ModelsCode2
ToMBench: Benchmarking Theory of Mind in Large Language ModelsCode2
tinyBenchmarks: evaluating LLMs with fewer examplesCode2
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity KnowledgeCode2
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language ModelsCode2
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language ModelsCode2
Steering Llama 2 via Contrastive Activation AdditionCode2
Biomedical knowledge graph-optimized prompt generation for large language modelsCode2
SEED-Bench-2: Benchmarking Multimodal Large Language ModelsCode2
MVBench: A Comprehensive Multi-modal Video Understanding BenchmarkCode2
Show:102550
← PrevPage 6 of 111Next →

No leaderboard results yet.