| Perception Test: A Diagnostic Benchmark for Multimodal Models | Oct 19, 2022 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Perception Test: A Diagnostic Benchmark for Multimodal Video Models | May 23, 2023 | DiagnosticGrounded Video Question Answering | CodeCode Available | 2 |
| AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator | Feb 15, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World | Jun 19, 2024 | DiagnosticMultiple-choice | CodeCode Available | 2 |
| Enhancing Diagnostic Accuracy in Rare and Common Fundus Diseases with a Knowledge-Rich Vision-Language Model | Jun 13, 2024 | DiagnosticImage Retrieval | CodeCode Available | 2 |
| CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification | Feb 27, 2024 | ClassificationDiagnostic | CodeCode Available | 2 |
| A Multimodal Vision Foundation Model for Clinical Dermatology | Oct 19, 2024 | DiagnosticLesion Segmentation | CodeCode Available | 2 |
| ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model | Apr 13, 2025 | DiagnosticLanguage Modeling | CodeCode Available | 2 |
| CodeS: Towards Building Open-source Language Models for Text-to-SQL | Feb 26, 2024 | Data AugmentationDiagnostic | CodeCode Available | 2 |
| BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models | Sep 12, 2023 | DiagnosticNatural Language Understanding | CodeCode Available | 2 |