SOTAVerified

Benchmarking

Papers

Showing 351400 of 5548 papers

TitleStatusHype
TxPert: Leveraging Biochemical Relationships for Out-of-Distribution Transcriptomic Perturbation PredictionCode1
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach0
LLM-based Evaluation Policy Extraction for Ecological Modeling0
SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival AnalysisCode0
Benchmarking the Myopic Trap: Positional Bias in Information RetrievalCode5
A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations0
TransBench: Benchmarking Machine Translation for Industrial-Scale Applications0
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI0
SlangDIT: Benchmarking LLMs in Interpretative Slang Translation0
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations0
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models BenchmarkingCode3
Benchmarking data encoding methods in Quantum Machine Learning0
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language ModelsCode1
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas0
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation0
SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference0
HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems ImmunityCode0
Benchmarking MOEAs for solving continuous multi-objective RL problemsCode0
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI0
A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design0
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning0
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on InequalitiesCode1
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering AgentsCode1
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings0
What are they talking about? Benchmarking Large Language Models for Knowledge-Grounded Discussion SummarizationCode1
OSS-Bench: Benchmark Generator for Coding LLMsCode0
Disambiguation in Conversational Question Answering in the Era of LLM: A Survey0
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models0
CompBench: Benchmarking Complex Instruction-guided Image Editing0
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical TasksCode1
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind0
GlobalGeoTree: A Multi-Granular Vision-Language Dataset for Global Tree Species ClassificationCode2
Machine Learning-Based Analysis of ECG and PCG Signals for Rheumatic Heart Disease Detection: A Scoping Review (2015-2025)0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
LOVE: Benchmarking and Evaluating Text-to-Video Generation and Video-to-Text InterpretationCode1
GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation0
SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable ThresholdsCode0
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and ChallengesCode0
Benchmarking CFAR and CNN-based Peak Detection Algorithms in ISAC under Hardware Impairments0
Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale0
ASR-FAIRBENCH: Measuring and Benchmarking Equity Across Speech Recognition Systems0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
HumaniBench: A Human-Centric Framework for Large Multimodal Models EvaluationCode0
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems0
Show:102550
← PrevPage 8 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified