SOTAVerified

Benchmarking

Papers

Showing 331340 of 5548 papers

TitleStatusHype
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
Learning Transferable Visual Models From Natural Language SupervisionCode2
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QACode2
CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation GenerationCode2
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence ActCode2
CoqPilot, a plugin for LLM-based generation of proofsCode2
Craftium: An Extensible Framework for Creating Reinforcement Learning EnvironmentsCode2
Benchmarking Laparoscopic Surgical Image Restoration and BeyondCode2
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsCode2
Show:102550
← PrevPage 34 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified