SOTAVerified

Benchmarking

Papers

Showing 101150 of 5548 papers

TitleStatusHype
Sum Rate Maximization for Pinching Antennas Assisted RSMA System With Multiple Waveguides0
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics0
Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning0
SDialog: A Python Toolkit for Synthetic Dialogue Generation and AnalysisCode2
Bench to the Future: A Pastcasting Benchmark for Forecasting Agents0
ICE-ID: A Novel Historical Census Data Benchmark Comparing NARS against LLMs, \& a ML Ensemble on Longitudinal Identity Resolution0
ScholarSearch: Benchmarking Scholar Searching Ability of LLMs0
Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models0
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person ScenariosCode0
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic EnvironmentsCode2
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models0
A Manually Annotated Image-Caption Dataset for Detecting Children in the WildCode0
GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric AlgebrasCode1
GRAIL: A Benchmark for GRaph ActIve Learning in Dynamic Sensing Environments0
Graph Attention-based Decentralized Actor-Critic for Dual-Objective Control of Multi-UAV Swarms0
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens0
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health CounselingCode1
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP0
Solving excited states for long-range interacting trapped ions with neural networks0
Benchmarking Foundation Speech and Language Models for Alzheimer's Disease and Related Dementia Detection from Spontaneous Speech0
The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine LearningCode0
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis0
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding0
REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models0
HuSc3D: Human Sculpture dataset for 3D object reconstructionCode0
RADAR: Benchmarking Language Models on Imperfect Tabular DataCode1
CuRe: Cultural Gaps in the Long Tail of Text-to-Image SystemsCode0
Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework0
GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors0
Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting0
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments0
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra0
Can AI Validate Science? Benchmarking LLMs for Accurate Scientific Claim Evidence ReasoningCode0
SOP-Bench: Complex Industrial SOPs for Evaluating LLM Agents0
How Far Are We from Optimal Reasoning Efficiency?Code0
LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and MappingCode0
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures0
Benchmarking Misuse Mitigation Against Covert AdversariesCode0
DeepFake Doctor: Diagnosing and Treating Audio-Video Fake Detection0
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques0
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and ChallengingCode1
Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions0
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based AttacksCode0
MMTU: A Massive Multi-Task Table Understanding and Reasoning BenchmarkCode1
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis0
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition0
Refer to Anything with Vision-Language Prompts0
DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models0
Show:102550
← PrevPage 3 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified