| MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models | Jun 24, 2024 | Benchmarking | —Unverified | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 |
| Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking | Dec 2, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering | Apr 8, 2024 | BenchmarkingMedical Question Answering | —Unverified | 0 |
| MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine | May 12, 2023 | Benchmarking | —Unverified | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| MediaEval 2018: Predicting Media Memorability Task | Jul 3, 2018 | BenchmarkingMemorization | —Unverified | 0 |
| MedMeshCNN -- Enabling MeshCNN for Medical Surface Models | Sep 10, 2020 | BenchmarkingSegmentation | —Unverified | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf | Feb 5, 2025 | BenchmarkingScheduling | —Unverified | 0 |
| MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models | Dec 5, 2024 | BenchmarkingDomain Generalization | —Unverified | 0 |
| MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks | Nov 13, 2023 | Benchmarking | —Unverified | 0 |
| MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP | Jun 4, 2025 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| MeltpoolNet: Melt pool Characteristic Prediction in Metal Additive Manufacturing Using Machine Learning | Jan 26, 2022 | ArticlesBenchmarking | —Unverified | 0 |
| MERGE -- A Bimodal Audio-Lyrics Dataset for Static Music Emotion Recognition | Jul 8, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Metaethical Perspectives on 'Benchmarking' AI Ethics | Apr 11, 2022 | BenchmarkingEthics | —Unverified | 0 |
| Meta learning to classify intent and slot labels with noisy few shot examples | Nov 30, 2020 | Benchmarkingintent-classification | —Unverified | 0 |
| Metastatic Cancer Outcome Prediction with Injective Multiple Instance Pooling | Mar 9, 2022 | BenchmarkingManagement | —Unverified | 0 |
| Methods and open-source toolkit for analyzing and visualizing challenge results | Oct 11, 2019 | Benchmarking | —Unverified | 0 |
| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry | Sep 30, 2020 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation | Mar 29, 2025 | Answer GenerationBenchmarking | —Unverified | 0 |
| Microtask crowdsourcing for disease mention annotation in PubMed abstracts | Aug 8, 2014 | Benchmarking | —Unverified | 0 |
| Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP) | Aug 6, 2023 | BenchmarkingImage Segmentation | —Unverified | 0 |
| MileBench: Benchmarking MLLMs in Long Context | Apr 29, 2024 | BenchmarkingDiagnostic | —Unverified | 0 |
| MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries | May 22, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Jun 26, 2025 | Benchmarking | —Unverified | 0 |
| Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification | Feb 6, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Mind the Retrosynthesis Gap: Bridging the divide between Single-step and Multi-step Retrosynthesis Prediction | Dec 12, 2022 | BenchmarkingMulti-step retrosynthesis | —Unverified | 0 |
| Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning | Dec 18, 2024 | BenchmarkingPosition | —Unverified | 0 |
| MIRAI: Evaluating LLM Agents for Event Forecasting | Jul 1, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning? | Feb 14, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic | Jun 27, 2021 | BenchmarkingFeature Compression | —Unverified | 0 |
| Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices | Nov 29, 2023 | BenchmarkingFederated Learning | —Unverified | 0 |
| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |
| MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking | Jul 14, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| MLHarness: A Scalable Benchmarking System for MLCommons | Nov 9, 2021 | Benchmarking | —Unverified | 0 |
| MLModelScope: A Distributed Platform for ML Model Evaluation and Benchmarking at Scale | Sep 25, 2019 | Benchmarking | —Unverified | 0 |
| MLModelScope: A Distributed Platform for Model Evaluation and Benchmarking at Scale | Feb 19, 2020 | Benchmarking | —Unverified | 0 |
| MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems | Oct 21, 2021 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| mlr3proba: An R Package for Machine Learning in Survival Analysis | Aug 18, 2020 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets | Jun 12, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding | Oct 25, 2024 | Benchmarkingdocument understanding | —Unverified | 0 |
| MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents | Jan 15, 2025 | BenchmarkingOptical Character Recognition (OCR) | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models | Apr 4, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| MMInA: Benchmarking Multihop Multimodal Internet Agents | Apr 15, 2024 | Benchmarking | —Unverified | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 |