SOTAVerified

Benchmarking

Papers

Showing 341350 of 5548 papers

TitleStatusHype
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
COALA: A Practical and Vision-Centric Federated Learning PlatformCode2
LtU-ILI: An All-in-One Framework for Implicit Inference in Astrophysics and CosmologyCode2
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256KCode2
Class-incremental Learning for Time Series: Benchmark and EvaluationCode2
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math DataCode2
MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical ReasoningCode2
ClimateLearn: Benchmarking Machine Learning for Weather and Climate ModelingCode2
Show:102550
← PrevPage 35 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified