SOTAVerified

Benchmarking

Papers

Showing 451500 of 5548 papers

TitleStatusHype
M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object DetectionCode1
MatTools: Benchmarking Large Language Models for Materials Science ToolsCode1
Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field TestsCode1
Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications GloballyCode1
Towards scalable surrogate models based on Neural Fields for large scale aerodynamic simulationsCode1
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving ConditionsCode1
Benchmarking AI scientists in omics data-driven biological researchCode1
FNBench: Benchmarking Robust Federated Learning against Noisy LabelsCode1
JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 MinutesCode1
scDrugMap: Benchmarking Large Foundation Models for Drug Response PredictionCode1
PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation modelsCode1
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action EnvironmentsCode1
RGB-Event Fusion with Self-Attention for Collision PredictionCode1
Benchmarking LLM Faithfulness in RAG with Evolving LeaderboardsCode1
Benchmarking LLMs' Swarm intelligenceCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time VideoCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social NetworksCode1
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System VerificationCode1
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice TextCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field EnlargementCode1
TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic ProgrammingCode1
LEMUR Neural Network Dataset: Towards Seamless AutoMLCode1
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMsCode1
Evolutionary Generation of Random Surreal Numbers for BenchmarkingCode1
An Empirical Study of GPT-4o Image Generation CapabilitiesCode1
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language ModelsCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
Generative Evaluation of Complex Reasoning in Large Language ModelsCode1
BlenderGym: Benchmarking Foundational Model Systems for Graphics EditingCode1
SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research PapersCode1
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric VideosCode1
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic ScenariosCode1
The Coralscapes Dataset: Semantic Scene Understanding in Coral ReefsCode1
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language ModelsCode1
Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite ImageryCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
GeoBenchX: Benchmarking LLMs for Multistep Geospatial TasksCode1
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model InteractionCode1
QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEsCode1
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data ContaminationCode1
JuDGE: Benchmarking Judgment Document Generation for Chinese Legal SystemCode1
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric VideosCode1
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific ResearchCode1
GNNs as Predictors of Agentic Workflow PerformancesCode1
Show:102550
← PrevPage 10 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified