SOTAVerified

Benchmarking

Papers

Showing 441450 of 5548 papers

TitleStatusHype
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering AgentsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical TasksCode1
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion SummarizationCode1
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and BenchmarksCode1
Show:102550
← PrevPage 45 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified