| CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering | Jan 30, 2025 | General KnowledgeLanguage Modeling | —Unverified | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning | Jan 11, 2025 | Decision MakingDiagnostic | CodeCode Available | 1 |
| LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models | Dec 31, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 |
| AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset | Nov 23, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 |
| MMDS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation | Oct 20, 2024 | Emotion RecognitionFacial Emotion Recognition | —Unverified | 0 |
| IMAS: A Comprehensive Agentic Approach to Rural Healthcare Delivery | Oct 13, 2024 | MedQA | CodeCode Available | 0 |
| MedMobile: A mobile-sized language model with expert-level clinical capabilities | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework | Oct 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining | Sep 30, 2024 | Continual PretrainingDomain Adaptation | —Unverified | 0 |
| A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? | Sep 23, 2024 | HallucinationMedQA | —Unverified | 0 |
| Reliable and diverse evaluation of LLM medical knowledge mastery | Sep 22, 2024 | DiversityMedQA | —Unverified | 0 |
| Eir: Thai Medical Large Language Models | Sep 13, 2024 | Language ModellingLarge Language Model | —Unverified | 0 |
| DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using Large Language Models | Sep 2, 2024 | Medical DiagnosisMedQA | —Unverified | 0 |
| Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions | Aug 1, 2024 | Medical Question AnsweringMedQA | CodeCode Available | 4 |
| Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks | Jun 17, 2024 | MedQA | CodeCode Available | 0 |
| MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge | Jun 5, 2024 | MedQA | CodeCode Available | 0 |
| MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering | Jun 3, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Superhuman performance in urology board questions by an explainable large language model enabled for context integration of the European Association of Urology guidelines: the UroBot study | Jun 3, 2024 | ChatbotLanguage Modeling | —Unverified | 0 |
| MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning | Jun 3, 2024 | DiagnosticMedQA | CodeCode Available | 1 |
| AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments | May 13, 2024 | Decision MakingDiagnostic | —Unverified | 0 |
| Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents | May 5, 2024 | MedQAQuestion Answering | —Unverified | 0 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 |
| Assessing The Potential Of Mid-Sized Language Models For Clinical QA | Apr 24, 2024 | MedQAQuestion Answering | —Unverified | 0 |
| LM^2: A Simple Society of Language Models Solves Complex Reasoning | Apr 2, 2024 | MathMedQA | CodeCode Available | 0 |