| Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language | Jun 25, 2024 | Benchmarking | —Unverified | 0 |
| Measuring CLEVRness: Black-box Testing of Visual Reasoning Models | Sep 29, 2021 | BenchmarkingDiagnostic | —Unverified | 0 |
| Measuring CLEVRness: Blackbox testing of Visual Reasoning Models | Feb 24, 2022 | BenchmarkingDiagnostic | —Unverified | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 |
| Measuring the Complexity of Domains Used to Evaluate AI Systems | Sep 18, 2020 | Benchmarking | —Unverified | 0 |
| Measuring the Effect of Causal Disentanglement on the Adversarial Robustness of Neural Network Models | Aug 21, 2023 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MechProNet: Machine Learning Prediction of Mechanical Properties in Metal Additive Manufacturing | Aug 21, 2022 | ArticlesBenchmarking | —Unverified | 0 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |