SOTAVerified

Benchmarking

Papers

Showing 251275 of 5548 papers

TitleStatusHype
EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and BenchmarkingCode2
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual EditingCode2
Benchmarking Agentic Workflow GenerationCode2
OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender SystemsCode2
EQ-Bench: An Emotional Intelligence Benchmark for Large Language ModelsCode2
EvalGIM: A Library for Evaluating Generative Image ModelsCode2
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation ModelsCode2
EffiBench: Benchmarking the Efficiency of Automatically Generated CodeCode2
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision TasksCode2
PyGraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your FingertipsCode2
LLM-Based Multi-Agent Systems are Scalable Graph Generative ModelsCode2
State-specific protein-ligand complex structure prediction with a multi-scale deep generative modelCode2
BARS: Towards Open Benchmarking for Recommender SystemsCode2
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and ThoroughlyCode2
EasyTPP: Towards Open Benchmarking Temporal Point ProcessesCode2
Fast Vision Transformers with HiLo AttentionCode2
HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and BeyondCode2
Deep Visual Geo-localization BenchmarkCode2
AutoPenBench: Benchmarking Generative Agents for Penetration TestingCode2
Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)Code2
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous DrivingCode2
DaisyRec 2.0: Benchmarking Recommendation for Rigorous EvaluationCode2
Customizable Perturbation Synthesis for Robust SLAM BenchmarkingCode2
Datasets and Benchmarks for Offline Safe Reinforcement LearningCode2
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and InteractionsCode2
Show:102550
← PrevPage 11 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified