SOTAVerified

Benchmarking

Papers

Showing 29262950 of 5548 papers

TitleStatusHype
CoDBench: A Critical Evaluation of Data-driven Models for Continuous Dynamical Systems0
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language ModelsCode2
Adaptive Control of an Inverted Pendulum by a Reinforcement Learning-based LQR Method0
The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks0
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph DataCode1
Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection0
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of ThingsCode1
Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym0
Benchmarking Collaborative Learning Methods Cost-Effectiveness for Prostate Segmentation0
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors0
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater0
Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts0
SMPLer-X: Scaling Up Expressive Human Pose and Shape EstimationCode3
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural NetworksCode1
FORB: A Flat Object Retrieval Benchmark for Universal Image EmbeddingCode1
LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking SuiteCode1
Revisiting Neural Program Smoothing for FuzzingCode1
Language Models as a Service: Overview of a New Paradigm and its Challenges0
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
The Trickle-down Impact of Reward (In-)consistency on RLHFCode1
OceanBench: The Sea Surface Height EditionCode1
Show:102550
← PrevPage 118 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified