SOTAVerified

Benchmarking

Papers

Showing 13761400 of 5548 papers

TitleStatusHype
CRoW: Benchmarking Commonsense Reasoning in Real-World TasksCode1
CriticBench: Benchmarking LLMs for Critique-Correct ReasoningCode1
CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives (full version)Code1
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via MetagradientCode1
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training ConflictsCode1
Benchmarking the Robustness of Deep Neural Networks to Common Corruptions in Digital PathologyCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
Benchmarking Image Retrieval for Visual LocalizationCode1
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object DetectionCode1
ArabicaQA: A Comprehensive Dataset for Arabic Question AnsweringCode1
MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUsCode1
COVID-19 event extraction from Twitter via extractive question answering with continuous promptsCode1
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI AgentsCode1
minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language ModelsCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
MLLM-DataEngine: An Iterative Refinement Approach for MLLMCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CAB: Comprehensive Attention Benchmarking on Long Sequence ModelingCode1
ByzFL: Research Framework for Robust Federated LearningCode1
Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement LearningCode1
CosPGD: an efficient white-box adversarial attack for pixel-wise prediction tasksCode1
scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell DataCode1
Show:102550
← PrevPage 56 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified