| MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge | Jun 5, 2024 | MedQA | CodeCode Available | 0 | 5 |
| Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering | Mar 7, 2024 | Information RetrievalLanguage Modelling | CodeCode Available | 0 | 5 |
| Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks | Jun 17, 2024 | MedQA | CodeCode Available | 0 | 5 |
| Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology | Apr 24, 2023 | BenchmarkingDecision Making | CodeCode Available | 0 | 5 |
| TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification | May 23, 2025 | MedQA | CodeCode Available | 0 | 5 |
| LM^2: A Simple Society of Language Models Solves Complex Reasoning | Apr 2, 2024 | MathMedQA | CodeCode Available | 0 | 5 |
| WiNGPT-3.0 Technical Report | May 23, 2025 | DiagnosticMedQA | CodeCode Available | 0 | 5 |
| IMAS: A Comprehensive Agentic Approach to Rural Healthcare Delivery | Oct 13, 2024 | MedQA | CodeCode Available | 0 | 5 |
| MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering | Sep 27, 2023 | In-Context LearningMedical Question Answering | —Unverified | 0 | 0 |
| MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering | Jun 3, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 | 0 |
| Medical Exam Question Answering with Large-scale Reading Comprehension | Feb 28, 2018 | MedQAQuestion Answering | —Unverified | 0 | 0 |
| Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards | Jun 13, 2025 | DiagnosticMedQA | —Unverified | 0 | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MMDS: A Multimodal Medical Diagnosis System Integrating Image Analysis and Knowledge-based Departmental Consultation | Oct 20, 2024 | Emotion RecognitionFacial Emotion Recognition | —Unverified | 0 | 0 |
| OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning | Feb 16, 2025 | MedQAMMLU | —Unverified | 0 | 0 |
| OpenMedLM: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models | Feb 29, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 | 0 |
| Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble | May 29, 2025 | Decision MakingMedQA | —Unverified | 0 | 0 |
| SM70: A Large Language Model for Medical Devices | Dec 12, 2023 | Decision MakingInformation Retrieval | —Unverified | 0 | 0 |
| Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework | Mar 7, 2025 | Conformal PredictionMedical Question Answering | —Unverified | 0 | 0 |
| Superhuman performance in urology board questions by an explainable large language model enabled for context integration of the European Association of Urology guidelines: the UroBot study | Jun 3, 2024 | ChatbotLanguage Modeling | —Unverified | 0 | 0 |
| Susceptibility of Large Language Models to User-Driven Factors in Medical Queries | Mar 26, 2025 | DiagnosticMedQA | —Unverified | 0 | 0 |
| What Does Neuro Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs | May 15, 2025 | AllBenchmarking | —Unverified | 0 | 0 |
| Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond | Feb 22, 2024 | FormMedical Question Answering | —Unverified | 0 | 0 |
| Reliable and diverse evaluation of LLM medical knowledge mastery | Sep 22, 2024 | DiversityMedQA | —Unverified | 0 | 0 |
| Disentangling Reasoning and Knowledge in Medical Large Language Models | May 16, 2025 | DiagnosticMedQA | —Unverified | 0 | 0 |