SOTAVerified

Benchmarking

Papers

Showing 676700 of 5548 papers

TitleStatusHype
ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsCode1
A Unified Taxonomy and Multimodal Dataset for Events in Invasion GamesCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
A Large-scale Comprehensive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy DetectionCode1
A User-Centric Multi-Intent Benchmark for Evaluating Large Language ModelsCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CharacterBench: Benchmarking Character Customization of Large Language ModelsCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language ModelsCode1
CHILI: Chemically-Informed Large-scale Inorganic Nanomaterials Dataset for Advancing Graph Machine LearningCode1
Attention, Please! Revisiting Attentive Probing for Masked Image ModelingCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital TwinsCode1
Align and Distill: Unifying and Improving Domain Adaptive Object DetectionCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Bencher: Simple and Reproducible Benchmarking for Black-Box OptimizationCode1
Automated Model Design and Benchmarking of 3D Deep Learning Models for COVID-19 Detection with Chest CT ScansCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
A Comprehensive Study on Large-Scale Graph Training: Benchmarking and RethinkingCode1
Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark FrameworkCode1
Show:102550
← PrevPage 28 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified