SOTAVerified

Benchmarking

Papers

Showing 10711080 of 5548 papers

TitleStatusHype
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization ModelingCode1
3D AffordanceNet: A Benchmark for Visual Object Affordance UnderstandingCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Show:102550
← PrevPage 108 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified