SOTAVerified

Benchmarking

Papers

Showing 26262650 of 5548 papers

TitleStatusHype
Conditional diffusions for amortized neural posterior estimationCode0
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation FrameworkCode0
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems0
FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage0
Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and ValidationCode0
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling0
Safe Load Balancing in Software-Defined-Networking0
Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies0
Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing0
ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical ImagesCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
Building Conformal Prediction Intervals with Approximate Message PassingCode0
Benchmarking Pathology Foundation Models: Adaptation Strategies and ScenariosCode0
Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification0
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data0
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping0
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence0
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational LearningCode0
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
Trust but Verify: Programmatic VLM Evaluation in the Wild0
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMsCode0
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large pCode0
debiaSAE: Benchmarking and Mitigating Vision-Language Model BiasCode0
UCFE: A User-Centric Financial Expertise Benchmark for Large Language ModelsCode0
Show:102550
← PrevPage 106 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified