SOTAVerified

Benchmarking

Papers

Showing 33513400 of 5548 papers

TitleStatusHype
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models0
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use0
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking0
MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation0
MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering0
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine0
MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models0
MediaEval 2018: Predicting Media Memorability Task0
MedMeshCNN -- Enabling MeshCNN for Medical Surface Models0
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding0
MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf0
MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models0
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks0
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP0
MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning0
MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition0
Metaethical Perspectives on 'Benchmarking' AI Ethics0
Meta learning to classify intent and slot labels with noisy few shot examples0
Metastatic Cancer Outcome Prediction with Injective Multiple Instance Pooling0
Methods and open-source toolkit for analyzing and visualizing challenge results0
Methods and Trends in Detecting Generated Images: A Comprehensive Review0
Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry0
MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models0
MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation0
Microtask crowdsourcing for disease mention annotation in PubMed abstracts0
Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP)0
MileBench: Benchmarking MLLMs in Long Context0
MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries0
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge0
Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification0
Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction0
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning0
MIRAI: Evaluating LLM Agents for Event Forecasting0
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?0
Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic0
Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices0
MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation0
MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking0
MLHarness: A Scalable Benchmarking System for MLCommons0
MLModelScope: A Distributed Platform for ML Model Evaluation and Benchmarking at Scale0
MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale0
MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems0
mlr3proba: An R Package for Machine Learning in Survival Analysis0
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets0
MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding0
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents0
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency0
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models0
MMInA: Benchmarking Multihop Multimodal Internet Agents0
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation0
Show:102550
← PrevPage 68 of 111Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1GPT-4 TurboACC0.56Unverified