SOTAVerified

Benchmarking

Papers

Showing 29012925 of 5548 papers

TitleStatusHype
Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning0
CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis0
Bringing Quantum Algorithms to Automated Machine Learning: A Systematic Review of AutoML Frameworks Regarding Extensibility for QML Algorithms0
A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling0
PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language ModelingCode1
Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report0
MLAgentBench: Evaluating Language Agents on Machine Learning ExperimentationCode2
Deep Reinforcement Learning Algorithms for Hybrid V2X Communication: A Benchmarking Study0
Can Language Models Employ the Socratic Method? Experiments with Code DebuggingCode1
Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal CarcinomaCode0
From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference0
On the Performance of Multimodal Language Models0
T^3Bench: Benchmarking Current Progress in Text-to-3D GenerationCode3
PGDQN: Preference-Guided Deep Q-NetworkCode1
CausalTime: Realistically Generated Time-series for Benchmarking of Causal DiscoveryCode1
EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations0
EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods0
Benchmarking and Improving Generator-Validator Consistency of Language Models0
GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth BenchmarkingCode1
Learning Quantum Processes with Quantum Statistical QueriesCode0
Adaptive Visual Scene Understanding: Incremental Scene Graph GenerationCode0
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
A New Real-World Video Dataset for the Comparison of Defogging Algorithms0
NewsRecLib: A PyTorch-Lightning Library for Neural News RecommendationCode1
TRAM: Benchmarking Temporal Reasoning for Large Language Models0
Show:102550
← PrevPage 117 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified