SOTAVerified

Benchmarking

Papers

Showing 476500 of 5548 papers

TitleStatusHype
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
CBench: Towards Better Evaluation of Question Answering Over Knowledge GraphsCode1
Chaos as an interpretable benchmark for forecasting and data-driven modellingCode1
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and SolutionsCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
CattleFace-RGBT: RGB-T Cattle Facial Landmark BenchmarkCode1
AnuraSet: A dataset for benchmarking Neotropical anuran calls identification in passive acoustic monitoringCode1
Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial LabelsCode1
Restore Anything Model via Efficient Degradation AdaptationCode1
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerCode1
Curious Hierarchical Actor-Critic Reinforcement LearningCode1
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE DetectionCode1
DACBench: A Benchmark Library for Dynamic Algorithm ConfigurationCode1
CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation AlgorithmsCode1
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMsCode1
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMsCode1
COSMOS: Catching Out-of-Context Misinformation with Self-Supervised LearningCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
A Platform for the Biomedical Application of Large Language ModelsCode1
Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language ModelsCode1
Can Language Models Make Fun? A Case Study in Chinese Comical CrosstalkCode1
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMsCode1
Show:102550
← PrevPage 20 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified