SOTAVerified

Benchmarking

Papers

Showing 291300 of 5548 papers

TitleStatusHype
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
Benchmarking Agentic Workflow GenerationCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Deep Visual Geo-localization BenchmarkCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval ModelsCode2
Show:102550
← PrevPage 30 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified