SOTAVerified

Benchmarking

Papers

Showing 351375 of 5548 papers

TitleStatusHype
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
LLM-based Evaluation Policy Extraction for Ecological Modeling0
Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach0
SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival AnalysisCode0
TransBench: Benchmarking Machine Translation for Industrial-Scale Applications0
A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations0
SlangDIT: Benchmarking LLMs in Interpretative Slang Translation0
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations0
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI0
Benchmarking data encoding methods in Quantum Machine Learning0
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models BenchmarkingCode3
Benchmarking the Myopic Trap: Positional Bias in Information RetrievalCode5
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation0
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas0
SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference0
HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems ImmunityCode0
Benchmarking MOEAs for solving continuous multi-objective RL problemsCode0
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning0
A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design0
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
Show:102550
← PrevPage 15 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified