SOTAVerified

Benchmarking

Papers

Showing 12511275 of 5548 papers

TitleStatusHype
Benchmarking Language Model Creativity: A Case Study on Code GenerationCode1
Benchmarking Robustness of 3D Object Detection to Common CorruptionsCode1
Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksCode1
Benchmarking emergency department triage prediction models with machine learning and large public electronic health recordsCode1
Comics Datasets Framework: Mix of Comics datasets for detection benchmarkingCode1
GraphGallery: A Platform for Fast Benchmarking and Easy Development of Graph Neural Networks Based Intelligent SoftwareCode1
Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine LearningCode1
Graphs, Constraints, and Search for the Abstraction and Reasoning CorpusCode1
Combinatorial Optimization with Policy Adaptation using Latent Space SearchCode1
Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model EvaluationCode1
A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research ChallengesCode1
Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsCode1
CombiBench: Benchmarking LLM Capability for Combinatorial MathematicsCode1
CommonPower: A Framework for Safe Data-Driven Smart Grid ControlCode1
CoDEx: A Comprehensive Knowledge Graph Completion BenchmarkCode1
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative AgentsCode1
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based HateCode1
Benchmarking Quantized Neural Networks on FPGAs with FINNCode1
CodeS: Natural Language to Code Repository via Multi-Layer SketchCode1
Towards Heterogeneous Long-tailed Learning: Benchmarking, Metrics, and ToolboxCode1
A framework for benchmarking clustering algorithmsCode1
CodeUpdateArena: Benchmarking Knowledge Editing on API UpdatesCode1
New Protocols and Negative Results for Textual Entailment Data CollectionCode1
HINT3: Raising the bar for Intent Detection in the WildCode1
CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code GenerationCode1
Show:102550
← PrevPage 51 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified