SOTAVerified

Benchmarking

Papers

Showing 301325 of 5548 papers

TitleStatusHype
A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement LearningCode2
Exponentially Faster Language ModellingCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image AnalysisCode2
FaceScore: Benchmarking and Enhancing Face Quality in Human GenerationCode2
FedGraph: A Research Library and Benchmark for Federated Graph LearningCode2
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsCode2
Foundational Models Defining a New Era in Vision: A Survey and OutlookCode2
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction SimulatorCode2
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine PerceptionCode2
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion TransferCode2
GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data AnalysisCode2
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsCode2
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
AIR-Bench: Benchmarking Large Audio-Language Models via Generative ComprehensionCode2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
Deep Visual Geo-localization BenchmarkCode2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
AiTLAS: Artificial Intelligence Toolbox for Earth ObservationCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?Code2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Show:102550
← PrevPage 13 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified