SOTAVerified

Benchmarking

Papers

Showing 10011050 of 5548 papers

TitleStatusHype
Data Generating Process to Evaluate Causal Discovery Techniques for Time Series DataCode1
OceanBench: The Sea Surface Height EditionCode1
Delving into Out-of-Distribution Detection with Medical Vision-Language ModelsCode1
Working Memory Capacity of ChatGPT: An Empirical StudyCode1
Contemporary Symbolic Regression Methods and their Relative PerformanceCode1
Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban IntersectionCode1
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
ConsumerBench: Benchmarking Generative AI Applications on End-User DevicesCode1
Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning AlgorithmsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity QuantificationCode1
Benchmarking Simulation-Based InferenceCode1
Application-Oriented Benchmarking of Quantum Generative Learning Using QUARKCode1
Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089Code1
A Comparison of Image Denoising MethodsCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
AI Agents That MatterCode1
Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking SequencesCode1
OpenDataVal: a Unified Benchmark for Data ValuationCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
AI Accelerator Survey and TrendsCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defensesCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
A skeletonization algorithm for gradient-based optimizationCode1
Benchmarking Visual Localization for Autonomous NavigationCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
A GPU-accelerated Large-scale Simulator for Transportation System Optimization BenchmarkingCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective OptimizationCode1
Coarse-to-Fine Q-attention with Learned Path RankingCode1
Benchmarking Pathology Feature Extractors for Whole Slide Image ClassificationCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope PredictionCode1
A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance ImagingCode1
A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise ModelsCode1
A global analysis of metrics used for measuring performance in natural language processingCode1
Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution TracesCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
Show:102550
← PrevPage 21 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified