SOTAVerified

Benchmarking

Papers

Showing 411420 of 5548 papers

TitleStatusHype
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
MMTU: A Massive Multi-Task Table Understanding and Reasoning BenchmarkCode1
macOSWorld: A Multilingual Interactive Benchmark for GUI AgentsCode1
Rethinking Machine Unlearning in Image Generation ModelsCode1
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsCode1
NetPress: Dynamically Generated LLM Benchmarks for Network ApplicationsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-TimeCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
Show:102550
← PrevPage 42 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified