| MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models | Jun 24, 2024 | Benchmarking | —Unverified | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 |
| Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking | Dec 2, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering | Apr 8, 2024 | BenchmarkingMedical Question Answering | —Unverified | 0 |
| MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine | May 12, 2023 | Benchmarking | —Unverified | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| MediaEval 2018: Predicting Media Memorability Task | Jul 3, 2018 | BenchmarkingMemorization | —Unverified | 0 |
| MedMeshCNN -- Enabling MeshCNN for Medical Surface Models | Sep 10, 2020 | BenchmarkingSegmentation | —Unverified | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf | Feb 5, 2025 | BenchmarkingScheduling | —Unverified | 0 |
| MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models | Dec 5, 2024 | BenchmarkingDomain Generalization | —Unverified | 0 |
| MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks | Nov 13, 2023 | Benchmarking | —Unverified | 0 |
| MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | Jun 4, 2025 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning | Jan 26, 2022 | ArticlesBenchmarking | —Unverified | 0 |
| MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition | Jul 8, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Metaethical Perspectives on 'Benchmarking' AI Ethics | Apr 11, 2022 | BenchmarkingEthics | —Unverified | 0 |
| Meta learning to classify intent and slot labels with noisy few shot examples | Nov 30, 2020 | Benchmarkingintent-classification | —Unverified | 0 |
| Metastatic Cancer Outcome Prediction with Injective Multiple Instance Pooling | Mar 9, 2022 | BenchmarkingManagement | —Unverified | 0 |
| Methods and open-source toolkit for analyzing and visualizing challenge results | Oct 11, 2019 | Benchmarking | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry | Sep 30, 2020 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation | Mar 29, 2025 | Answer GenerationBenchmarking | —Unverified | 0 |
| Microtask crowdsourcing for disease mention annotation in PubMed abstracts | Aug 8, 2014 | Benchmarking | —Unverified | 0 |