SOTAVerified

Benchmarking

Papers

Showing 35263550 of 5548 papers

TitleStatusHype
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering0
MechProNet: Machine Learning Prediction of Mechanical Properties in Metal Additive Manufacturing0
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models0
Benchmarking Large Language Models on Homework Assessment in Circuit Analysis0
Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs0
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization0
MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale0
Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments0
EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition0
What can 5.17 billion regression fits tell us about artificial models of the human visual system?0
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
Knowledge-guided Contextual Gene Set Analysis Using Large Language Models0
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
MediaEval 2018: Predicting Media Memorability Task0
Benchmarking Large Language Models for Handwritten Text Recognition0
MedMeshCNN -- Enabling MeshCNN for Medical Surface Models0
Benchmarking large language models for materials synthesis: the case of atomic layer deposition0
Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
Show:102550
← PrevPage 142 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified