SOTAVerified

Benchmarking

Papers

Showing 401425 of 5548 papers

TitleStatusHype
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
Protein Structure Tokenization: Benchmarking and New RecipeCode1
Prompt Tuned Embedding Classification for Multi-Label Industry Sector AllocationCode1
Large Scale MRI Collection and Segmentation of Cirrhotic LiverCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19Code1
CIDEr: Consensus-based Image Description EvaluationCode1
MC-Blur: A Comprehensive Benchmark for Image DeblurringCode1
Benchmarking Data Science AgentsCode1
Benchmarking deep inverse models over time, and the neural-adjoint methodCode1
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methodsCode1
CheXphoto: 10,000+ Photos and Transformations of Chest X-rays for Benchmarking Deep Learning RobustnessCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
Benchmarking Data-driven Surrogate Simulators for Artificial Electromagnetic MaterialsCode1
CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report LabelingCode1
CIBench: Evaluating Your LLMs with a Code Interpreter PluginCode1
Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learningCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
Benchmarking Commonsense Knowledge Base Population with an Effective Evaluation DatasetCode1
Accelerated and interpretable oblique random survival forestsCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization CorrelationsCode1
Benchmarking Chinese Text Recognition: Datasets, Baselines, and an Empirical StudyCode1
Show:102550
← PrevPage 17 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified