SOTAVerified

Benchmarking

Papers

Showing 17761800 of 5548 papers

TitleStatusHype
Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems0
A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents0
Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks0
NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction0
NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation0
ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations0
Benchmarking data encoding methods in Quantum Machine Learning0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis0
Explaining Unreliable Perception in Automated Driving: A Fuzzy-based Monitoring Approach0
TransBench: Benchmarking Machine Translation for Industrial-Scale Applications0
A Data-Driven Method to Identify IBRs with Dominant Participation in Sub-Synchronous Oscillations0
SlangDIT: Benchmarking LLMs in Interpretative Slang Translation0
LLM-based Evaluation Policy Extraction for Ecological Modeling0
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI0
SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival AnalysisCode0
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas0
Benchmarking and Confidence Evaluation of LALMs For Temporal ReasoningCode0
LEXam: Benchmarking Legal Reasoning on 340 Law Exams0
CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models0
Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings0
Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference0
SzCORE as a benchmark: report from the seizure detection challenge at the 2025 AI in Epilepsy and Neurological Disorders Conference0
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning0
A Comprehensive Benchmarking Platform for Deep Generative Models in Molecular Design0
Show:102550
← PrevPage 72 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified