SOTAVerified

Benchmarking

Papers

Showing 24312440 of 5548 papers

TitleStatusHype
Do LLM Evaluators Prefer Themselves for a Reason?Code0
Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and ReasoningCode0
Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and DatasetCode0
Generalization and Regularization in DQNCode0
Assigning Species Information to Corresponding Genes by a Sequence Labeling FrameworkCode0
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
Show:102550
← PrevPage 244 of 555Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified