SOTAVerified

Benchmarking

Papers

Showing 29012950 of 5548 papers

TitleStatusHype
Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning0
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis0
Bringing Quantum Algorithms to Automated Machine Learning: A Systematic Review of AutoML Frameworks Regarding Extensibility for QML Algorithms0
A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling0
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language ModelingCode1
Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report0
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
Deep Reinforcement Learning Algorithms for Hybrid V2X Communication: A Benchmarking Study0
Can Language Models Employ the Socratic Method? Experiments with Code DebuggingCode1
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference0
On the Performance of Multimodal Language Models0
T^3Bench: Benchmarking Current Progress in Text-to-3D GenerationCode3
PGDQN: Preference-Guided Deep Q-NetworkCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations0
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods0
Benchmarking and Improving Generator-Validator Consistency of Language Models0
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth BenchmarkingCode1
Learning Quantum Processes with Quantum Statistical QueriesCode0
Adaptive Visual Scene Understanding: Incremental Scene Graph GenerationCode0
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
A New Real-World Video Dataset for the Comparison of Defogging Algorithms0
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
TRAM: Benchmarking Temporal Reasoning for Large Language Models0
CoDBench: A Critical Evaluation of Data-driven Models for Continuous Dynamical Systems0
FELM: Benchmarking Factuality Evaluation of Large Language ModelsCode1
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language ModelsCode2
Adaptive Control of an Inverted Pendulum by a Reinforcement Learning-based LQR Method0
The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks0
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph DataCode1
Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection0
FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of ThingsCode1
Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym0
Benchmarking Collaborative Learning Methods Cost-Effectiveness for Prostate Segmentation0
Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?Code1
Benchmarking Cognitive Biases in Large Language Models as EvaluatorsCode1
Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors0
A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater0
Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts0
SMPLer-X: Scaling Up Expressive Human Pose and Shape EstimationCode3
G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural NetworksCode1
FORB: A Flat Object Retrieval Benchmark for Universal Image EmbeddingCode1
LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking SuiteCode1
Revisiting Neural Program Smoothing for FuzzingCode1
Language Models as a Service: Overview of a New Paradigm and its Challenges0
LawBench: Benchmarking Legal Knowledge of Large Language ModelsCode2
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and BeyondCode2
The Trickle-down Impact of Reward (In-)consistency on RLHFCode1
OceanBench: The Sea Surface Height EditionCode1
Show:102550
← PrevPage 59 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified