| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 | 5 |
| Perception Test: A Diagnostic Benchmark for Multimodal Video Models | May 23, 2023 | DiagnosticGrounded Video Question Answering | CodeCode Available | 2 | 5 |
| Citrus: Leveraging Expert Cognitive Pathways in a Medical Language Model for Advanced Medical Decision Support | Feb 25, 2025 | Decision MakingDiagnostic | CodeCode Available | 2 | 5 |
| A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model | Jul 22, 2024 | Diagnosticwhole slide images | CodeCode Available | 2 | 5 |
| ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model | Apr 13, 2025 | DiagnosticLanguage Modeling | CodeCode Available | 2 | 5 |
| CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification | Feb 27, 2024 | ClassificationDiagnostic | CodeCode Available | 2 | 5 |
| CodeS: Towards Building Open-source Language Models for Text-to-SQL | Feb 26, 2024 | Data AugmentationDiagnostic | CodeCode Available | 2 | 5 |
| CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis | Jul 18, 2024 | Decision MakingDiagnostic | CodeCode Available | 2 | 5 |
| BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models | Sep 12, 2023 | DiagnosticNatural Language Understanding | CodeCode Available | 2 | 5 |
| ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World | Jun 19, 2024 | DiagnosticMultiple-choice | CodeCode Available | 2 | 5 |