SOTAVerified

Benchmarking

Papers

Showing 10761100 of 5548 papers

TitleStatusHype
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
Collective Knowledge: organizing research projects as a database of reusable components and portable workflows with common APIsCode1
A Review and Efficient Implementation of Scene Graph Generation MetricsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
2.5D Visual Relationship DetectionCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional DependenciesCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarkingCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
Benchmarking Large Multimodal Models against Common CorruptionsCode1
CODEBench: A Neural Architecture and Hardware Accelerator Co-Design FrameworkCode1
COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test SuiteCode1
CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial OptimizationCode1
Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking PlatformCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
CLoG: Benchmarking Continual Learning of Image Generation ModelsCode1
Clinical Prompt Learning with Frozen Language ModelsCode1
CODEMENV: Benchmarking Large Language Models on Code MigrationCode1
CloudEval-YAML: A Practical Benchmark for Cloud Configuration GenerationCode1
Show:102550
← PrevPage 44 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified