| Methods and Trends in Detecting Generated Images: A Comprehensive Review | Feb 21, 2025 | BenchmarkingDeepFake Detection | —Unverified | 0 |
| Metrics for Benchmarking and Uncertainty Quantification: Quality, Applicability, and a Path to Best Practices for Machine Learning in Chemistry | Sep 30, 2020 | BenchmarkingBIG-bench Machine Learning | —Unverified | 0 |
| MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models | Feb 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation | Mar 29, 2025 | Answer GenerationBenchmarking | —Unverified | 0 |
| Microtask crowdsourcing for disease mention annotation in PubMed abstracts | Aug 8, 2014 | Benchmarking | —Unverified | 0 |
| Microvasculature Segmentation in Human BioMolecular Atlas Program (HuBMAP) | Aug 6, 2023 | BenchmarkingImage Segmentation | —Unverified | 0 |
| MileBench: Benchmarking MLLMs in Long Context | Apr 29, 2024 | BenchmarkingDiagnostic | —Unverified | 0 |
| MiLQ: Benchmarking IR Models for Bilingual Web Search with Mixed Language Queries | May 22, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge | Jun 26, 2025 | Benchmarking | —Unverified | 0 |
| Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification | Feb 6, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |