SOTAVerified

Benchmarking

Papers

Showing 9761000 of 5548 papers

TitleStatusHype
MNIST-C: A Robustness Benchmark for Computer VisionCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasetsCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
MONICA: Benchmarking on Long-tailed Medical Image ClassificationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAMCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
3DYoga90: A Hierarchical Video Dataset for Yoga Pose UnderstandingCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture BreedingCode1
Prompt Tuned Embedding Classification for Multi-Label Industry Sector AllocationCode1
Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic PlanningCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution ImageryCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection BenchmarkCode1
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph DataCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Show:102550
← PrevPage 40 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified