SOTAVerified

Benchmarking

Papers

Showing 14511475 of 5548 papers

TitleStatusHype
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions FollowingCode2
Benchmarking Pathology Foundation Models: Adaptation Strategies and ScenariosCode0
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and StyleCode2
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping0
Comprehensive benchmarking of large language models for RNA secondary structure predictionCode1
A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data0
Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence0
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent EvaluationCode2
FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational LearningCode0
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement LearningCode2
Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review0
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs0
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor EnvironmentsCode1
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart ProblemsCode1
Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them allCode1
Sum Secrecy Rate Maximization for Full Duplex ISAC Systems0
UCFE: A User-Centric Financial Expertise Benchmark for Large Language ModelsCode0
Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large pCode0
debiaSAE: Benchmarking and Mitigating Vision-Language Model BiasCode0
ORCHID: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue SummarizationCode0
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMsCode0
Trust but Verify: Programmatic VLM Evaluation in the Wild0
Configurable Embodied Data Generation for Class-Agnostic RGB-D Video Segmentation0
Understanding the Role of LLMs in Multimodal Evaluation BenchmarksCode0
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluationCode1
Show:102550
← PrevPage 59 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified