SOTAVerified

Benchmarking

Papers

Showing 24512475 of 5548 papers

TitleStatusHype
Strong and Simple Baselines for Multimodal Utterance EmbeddingsCode0
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician ExamsCode0
DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language ModelsCode0
Benchmarking Large Language Models for Math Reasoning TasksCode0
Benchmarking Large Language Models for Image Classification of Marine MammalsCode0
Flexible Generation of Preference Data for Recommendation AnalysisCode0
Divergent Creativity in Humans and Large Language ModelsCode0
Local manifold learning and its link to domain-based physics knowledgeCode0
Distributional Depth-Based Estimation of Object Articulation ModelsCode0
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image SegmentationCode0
A Framework for Generating Informative Benchmark InstancesCode0
GiantHunter: Accurate detection of giant virus in metagenomic data using reinforcement-learning and Monte Carlo tree searchCode0
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient VoiceCode0
Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client AvailabilityCode0
Generalization and Regularization in DQNCode0
Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AICode0
exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment ProblemCode0
Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory InstructionsCode0
Expecting The Unexpected: Towards Broad Out-Of-Distribution DetectionCode0
Experimental Analysis of Large-scale Learnable Vector Storage CompressionCode0
Benchmarking Large Language Models for Molecule Prediction TasksCode0
DispBench: Benchmarking Disparity Estimation to Synthetic CorruptionsCode0
Are Large Language Models Good at Utility Judgments?Code0
DispaRisk: Auditing Fairness Through Usable InformationCode0
GenderBench: Evaluation Suite for Gender Biases in LLMsCode0
Show:102550
← PrevPage 99 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified