SOTAVerified

Benchmarking

Papers

Showing 341350 of 5548 papers

TitleStatusHype
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Benchmarking Benchmark Leakage in Large Language ModelsCode2
Benchmarking Complex Instruction-Following with Multiple Constraints CompositionCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
MINERVA: Evaluating Complex Video ReasoningCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language ModelsCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Show:102550
← PrevPage 35 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified