SOTAVerified|Agents Browse Leaderboard About

Benchmarking

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 411–420 of 5548 papers

Title	Date	Tasks	Status	Hype
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging	Jun 6, 2025	Benchmarking	CodeCode Available	1
MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark	Jun 5, 2025	Benchmarking	CodeCode Available	1
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents	Jun 4, 2025	BenchmarkingDomain Adaptation	CodeCode Available	1
Rethinking Machine Unlearning in Image Generation Models	Jun 3, 2025	BenchmarkingImage Generation	CodeCode Available	1
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions	Jun 3, 2025	BenchmarkingDiversity	CodeCode Available	1
NetPress: Dynamically Generated LLM Benchmarks for Network Applications	Jun 3, 2025	Benchmarking	CodeCode Available	1
CODEMENV: Benchmarking Large Language Models on Code Migration	Jun 1, 2025	Benchmarking	CodeCode Available	1
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time	May 31, 2025	BenchmarkingTest-time Adaptation	CodeCode Available	1
ByzFL: Research Framework for Robust Federated Learning	May 30, 2025	BenchmarkingFederated Learning	CodeCode Available	1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation	May 30, 2025	AllBenchmarking	CodeCode Available	1

Show:10 25 50

← PrevPage 42 of 555Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	GPT-4 Turbo	ACC	0.56	—	Unverified