SOTAVerified

Benchmarking

Papers

Showing 12011225 of 5548 papers

TitleStatusHype
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Explainable Global Wildfire Prediction Models using Graph Neural NetworksCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box FunctionsCode1
Failure Detection in Medical Image Classification: A Reality Check and Benchmarking TestbedCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and BeyondCode1
Fast hyperboloid decision tree algorithmsCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of ThingsCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPTCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and EfficiencyCode1
FFB: A Fair Fairness Benchmark for In-Processing Group Fairness MethodsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
Benchmarking Large Language Models for Automated Verilog RTL Code GenerationCode1
Flames: Benchmarking Value Alignment of LLMs in ChineseCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
A Reinforcement Learning Environment for Multi-Service UAV-enabled Wireless SystemsCode1
FORB: A Flat Object Retrieval Benchmark for Universal Image EmbeddingCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Show:102550
← PrevPage 49 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified