SOTAVerified

Benchmarking

Papers

Showing 9761000 of 5548 papers

TitleStatusHype
EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models0
Verifiable Format Control for Large Language Model Generations0
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization0
LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset0
Large Language Models for Multi-Robot Systems: A SurveyCode1
SoK: Benchmarking Poisoning Attacks and Defenses in Federated LearningCode2
Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated SamplesCode0
PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature DataCode0
Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications0
TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential DynamicsCode0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance EstimationCode2
Optimal PMU Placement for Kalman Filtering of DAE Power System Models0
Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials0
PICBench: Benchmarking LLMs for Photonic Integrated Circuits DesignCode1
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods0
LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation0
Dynamic benchmarking framework for LLM-based conversational data capture0
Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented GenerationCode4
Evalita-LLM: Benchmarking Large Language Models on Italian0
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models0
A comparison of translation performance between DeepL and SupertextCode0
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning DatasetsCode0
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
Show:102550
← PrevPage 40 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified