SOTAVerified

Benchmarking

Papers

Showing 33513375 of 5548 papers

TitleStatusHype
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
MediaEval 2018: Predicting Media Memorability Task0
MedMeshCNN -- Enabling MeshCNN for Medical Surface Models0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models0
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
Metaethical Perspectives on 'Benchmarking' AI Ethics0
Meta learning to classify intent and slot labels with noisy few shot examples0
Metastatic Cancer Outcome Prediction with Injective Multiple Instance Pooling0
Methods and open-source toolkit for analyzing and visualizing challenge results0
Methods and Trends in Detecting Generated Images: A Comprehensive Review0
Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Microtask crowdsourcing for disease mention annotation in PubMed abstracts0
Show:102550
← PrevPage 135 of 222Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified