SOTAVerified

Benchmarking

Papers

Showing 24112420 of 5548 papers

TitleStatusHype
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
HATE-ITA: New Baselines for Hate Speech Detection in ItalianCode0
Illuminating the Diversity-Fitness Trade-Off in Black-Box OptimizationCode0
Evaluating Shallow and Deep Neural Networks for Network Intrusion Detection Systems in Cyber SecurityCode0
Separating form and meaning: Using self-consistency to quantify task understanding across multiple sensesCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal DataCode0
Show:102550
← PrevPage 242 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified