SOTAVerified

Benchmarking

Papers

Showing 901950 of 5548 papers

TitleStatusHype
An Image Dataset for Benchmarking Recommender Systems with Raw PixelsCode1
LEAF: A Benchmark for Federated SettingsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacousticsCode1
AD-LLM: Benchmarking Large Language Models for Anomaly DetectionCode1
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan DatasetsCode1
An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening ModelsCode1
Benchmarking Counterfactual Image GenerationCode1
AdsorbML: A Leap in Efficiency for Adsorption Energy Calculations using Generalizable Machine Learning PotentialsCode1
Less Is More: A Comparison of Active Learning Strategies for 3D Medical Image SegmentationCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking Data-driven Surrogate Simulators for Artificial Electromagnetic MaterialsCode1
A Survey of Pathology Foundation Model: Progress and Future DirectionsCode1
A Comprehensive Benchmark for RNA 3D Structure-Function ModelingCode1
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule GenerationCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Benchmarking Data Science AgentsCode1
Light Field Salient Object Detection: A Review and BenchmarkCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
LLMCBench: Benchmarking Large Language Model Compression for Efficient DeploymentCode1
A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive CareCode1
MC-Blur: A Comprehensive Benchmark for Image DeblurringCode1
AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMMCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19Code1
Benchmarking deep inverse models over time, and the neural-adjoint methodCode1
A Call to Reflect on Evaluation Practices for Failure Detection in Image ClassificationCode1
Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and ToolkitCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
LoLI-Street: Benchmarking Low-Light Image Enhancement and BeyondCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
Benchmarking Deep Learning Interpretability in Time Series PredictionsCode1
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERTCode1
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual DependencyCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Benchmarking Deep Models for Salient Object DetectionCode1
Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality RobustnessCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and TasksCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and CollaborationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
High-Dimensional Inference in Bayesian NetworksCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality MetricsCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
Show:102550
← PrevPage 19 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified