| Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat | May 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 | 0 |
| MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors | Feb 26, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Matrix-Free Preconditioning in Online Learning | May 29, 2019 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking Large Language Model Volatility | Nov 26, 2023 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| Benchmarking Large Language Models with Integer Sequence Generation Tasks | Nov 7, 2024 | BenchmarkingComputational Efficiency | —Unverified | 0 | 0 |
| Maximum Categorical Cross Entropy (MCCE): A noise-robust alternative loss function to mitigate racial bias in Convolutional Neural Networks (CNNs) by reducing overfitting | Jan 1, 2021 | BenchmarkingGeneral Classification | —Unverified | 0 | 0 |
| MaxpoolNMS: Getting Rid of NMS Bottlenecks in Two-Stage Object Detectors | Jun 1, 2019 | BenchmarkingGeneral Classification | —Unverified | 0 | 0 |
| Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting | Jun 9, 2025 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MBA-VO: Motion Blur Aware Visual Odometry | Mar 25, 2021 | BenchmarkingVisual Odometry | —Unverified | 0 | 0 |
| Towards Class-agnostic Tracking Using Feature Decorrelation in Point Clouds | Feb 28, 2022 | BenchmarkingObject Tracking | —Unverified | 0 | 0 |
| MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model | May 24, 2024 | BenchmarkingDemand Forecasting | —Unverified | 0 | 0 |
| MCL-3D: a database for stereoscopic image quality assessment using 2D-image-plus-depth source | Mar 23, 2014 | BenchmarkingImage Quality Assessment | —Unverified | 0 | 0 |
| Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction | Oct 8, 2023 | BenchmarkingDecoder | —Unverified | 0 | 0 |
| MCUBench: A Benchmark of Tiny Object Detectors on MCUs | Sep 27, 2024 | BenchmarkingModel Selection | —Unverified | 0 | 0 |
| MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification | May 29, 2024 | Benchmarking | —Unverified | 0 | 0 |
| MDR-DeePC: Model-Inspired Distributionally Robust Data-Enabled Predictive Control | Jun 24, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking Large Language Models via Random Variables | Jan 20, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 | 0 |
| Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language | Jun 25, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Measuring CLEVRness: Black-box Testing of Visual Reasoning Models | Sep 29, 2021 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Measuring CLEVRness: Blackbox testing of Visual Reasoning Models | Feb 24, 2022 | BenchmarkingDiagnostic | —Unverified | 0 | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 | 0 |
| Measuring the Complexity of Domains Used to Evaluate AI Systems | Sep 18, 2020 | Benchmarking | —Unverified | 0 | 0 |
| Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models | Aug 21, 2023 | Adversarial RobustnessBenchmarking | —Unverified | 0 | 0 |
| Towards Effective Disambiguation for Machine Translation with Large Language Models | Sep 20, 2023 | BenchmarkingIn-Context Learning | —Unverified | 0 | 0 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| MechProNet: Machine Learning Prediction of Mechanical Properties in Metal Additive Manufacturing | Aug 21, 2022 | ArticlesBenchmarking | —Unverified | 0 | 0 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 | 0 |
| Benchmarking Large Language Models on Homework Assessment in Circuit Analysis | Jun 5, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs | Jan 26, 2024 | BenchmarkingKnowledge Graphs | —Unverified | 0 | 0 |
| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 | 0 |
| EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition | Jun 5, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 | 0 |
| What can 5.17 billion regression fits tell us about artificial models of the human visual system? | Oct 12, 2021 | Benchmarking | —Unverified | 0 | 0 |
| MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models | Jun 24, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques | Jun 6, 2025 | BenchmarkingModel Selection | —Unverified | 0 | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking | Dec 2, 2024 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 | 0 |
| MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering | Apr 8, 2024 | BenchmarkingMedical Question Answering | —Unverified | 0 | 0 |
| Knowledge-guided Contextual Gene Set Analysis Using Large Language Models | Jun 4, 2025 | Benchmarking | —Unverified | 0 | 0 |
| MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine | May 12, 2023 | Benchmarking | —Unverified | 0 | 0 |
| MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models | May 16, 2025 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MediaEval 2018: Predicting Media Memorability Task | Jul 3, 2018 | BenchmarkingMemorization | —Unverified | 0 | 0 |
| Benchmarking Large Language Models for Handwritten Text Recognition | Mar 19, 2025 | BenchmarkingHandwritten Text Recognition | —Unverified | 0 | 0 |
| MedMeshCNN -- Enabling MeshCNN for Medical Surface Models | Sep 10, 2020 | BenchmarkingSegmentation | —Unverified | 0 | 0 |
| Benchmarking large language models for materials synthesis: the case of atomic layer deposition | Dec 13, 2024 | BenchmarkingHallucination | —Unverified | 0 | 0 |
| Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents | Oct 1, 2024 | BenchmarkingConversational Question Answering | —Unverified | 0 | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 | 0 |