SOTAVerified

Benchmarking

Papers

Showing 401425 of 5548 papers

TitleStatusHype
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World AnomaliesCode1
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor ProductsCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric AlgebrasCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
MMTU: A Massive Multi-Task Table Understanding and Reasoning BenchmarkCode1
macOSWorld: A Multilingual Interactive Benchmark for GUI AgentsCode1
NetPress: Dynamically Generated LLM Benchmarks for Network ApplicationsCode1
Rethinking Machine Unlearning in Image Generation ModelsCode1
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-TimeCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
Toward Memory-Aided World Models: Benchmarking via Spatial ConsistencyCode1
SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing ProblemCode1
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and BenchmarkingCode1
Show:102550
← PrevPage 17 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified