| Benchmarking Generative Models on Computational Thinking Tests in Elementary Visual Programming | Jun 14, 2024 | BenchmarkingGeneral Knowledge | —Unverified | 0 | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 | 0 |
| MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models | Apr 4, 2025 | BenchmarkingImage Generation | —Unverified | 0 | 0 |
| MMInA: Benchmarking Multihop Multimodal Internet Agents | Apr 15, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) | Jan 21, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking General-Purpose In-Context Learning | May 27, 2024 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation | May 23, 2025 | Audio GenerationBenchmarking | —Unverified | 0 | 0 |
| MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks | May 22, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 | 0 |
| MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines | Sep 19, 2024 | Benchmarking | —Unverified | 0 | 0 |