SOTAVerified

Benchmarking

Papers

Showing 401450 of 5548 papers

TitleStatusHype
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech SystemsCode1
GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World AnomaliesCode1
The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor ProductsCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric AlgebrasCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
MMTU: A Massive Multi-Task Table Understanding and Reasoning BenchmarkCode1
macOSWorld: A Multilingual Interactive Benchmark for GUI AgentsCode1
Rethinking Machine Unlearning in Image Generation ModelsCode1
NetPress: Dynamically Generated LLM Benchmarks for Network ApplicationsCode1
ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid MotionsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-TimeCode1
Bench4KE: Benchmarking Automated Competency Question GenerationCode1
Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image GenerationCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Toward Memory-Aided World Models: Benchmarking via Spatial ConsistencyCode1
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and BenchmarkingCode1
SVRPBench: A Realistic Benchmark for Stochastic Vehicle Routing ProblemCode1
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and BenchmarkingCode1
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS EnvironmentsCode1
Bencher: Simple and Reproducible Benchmarking for Black-Box OptimizationCode1
FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone NavigationCode1
Benchmarking Multimodal Knowledge Conflict for Large Multimodal ModelsCode1
OB3D: A New Dataset for Benchmarking Omnidirectional 3D Reconstruction Using BlenderCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question AnsweringCode1
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics ReasoningCode1
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering WorkflowCode1
Semantic Correspondence: Unified Benchmarking and a Strong BaselineCode1
Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questionsCode1
Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge GraphCode1
REOBench: Benchmarking Robustness of Earth Observation Foundation ModelsCode1
EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational ScenariosCode1
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic ScenariosCode1
Benchmarking Retrieval-Augmented Multimomal Generation for Document Question AnsweringCode1
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering AgentsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical TasksCode1
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion SummarizationCode1
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins -- Dataset and BenchmarksCode1
Show:102550
← PrevPage 9 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified