SOTAVerified

Benchmarking

Papers

Showing 10261050 of 5548 papers

TitleStatusHype
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defensesCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
A skeletonization algorithm for gradient-based optimizationCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective OptimizationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance ImagingCode1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise ModelsCode1
A global analysis of metrics used for measuring performance in natural language processingCode1
Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution TracesCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
Show:102550
← PrevPage 42 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified