SOTAVerified

Benchmarking

Papers

Showing 451475 of 5548 papers

TitleStatusHype
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object DetectionCode1
MatTools: Benchmarking Large Language Models for Materials Science ToolsCode1
Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field TestsCode1
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications GloballyCode1
Towards scalable surrogate models based on Neural Fields for large scale aerodynamic simulationsCode1
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving ConditionsCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
FNBench: Benchmarking Robust Federated Learning against Noisy LabelsCode1
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 MinutesCode1
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation modelsCode1
scDrugMap: Benchmarking Large Foundation Models for Drug Response PredictionCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
RGB-Event Fusion with Self-Attention for Collision PredictionCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Benchmarking LLMs' Swarm intelligenceCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social NetworksCode1
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System VerificationCode1
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice TextCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field EnlargementCode1
TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic ProgrammingCode1
LEMUR Neural Network Dataset: Towards Seamless AutoMLCode1
Show:102550
← PrevPage 19 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified