SOTAVerified

Benchmarking

Papers

Showing 22762300 of 5548 papers

TitleStatusHype
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization0
Verifiable Format Control for Large Language Model Generations0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature DataCode0
Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEsCode0
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials0
Optimal PMU Placement for Kalman Filtering of DAE Power System Models0
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential DynamicsCode0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning DatasetsCode0
Evalita-LLM: Benchmarking Large Language Models on Italian0
A comparison of translation performance between DeepL and SupertextCode0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models0
Dynamic benchmarking framework for LLM-based conversational data capture0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering0
EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities0
Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural NetworksCode0
True Online TD-Replan(lambda) Achieving Planning through Replaying0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency0
Show:102550
← PrevPage 92 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified