| Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks | Jun 17, 2024 | MedQA | CodeCode Available | 0 |
| MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge | Jun 5, 2024 | MedQA | CodeCode Available | 0 |
| MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering | Jun 3, 2024 | Medical Question AnsweringMedQA | —Unverified | 0 |
| Superhuman performance in urology board questions by an explainable large language model enabled for context integration of the European Association of Urology guidelines: the UroBot study | Jun 3, 2024 | ChatbotLanguage Modeling | —Unverified | 0 |
| MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning | Jun 3, 2024 | DiagnosticMedQA | CodeCode Available | 1 |
| AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments | May 13, 2024 | Decision MakingDiagnostic | —Unverified | 0 |
| Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents | May 5, 2024 | MedQAQuestion Answering | —Unverified | 0 |
| Capabilities of Gemini Models in Medicine | Apr 29, 2024 | In-Context LearningMedQA | —Unverified | 0 |
| Assessing The Potential Of Mid-Sized Language Models For Clinical QA | Apr 24, 2024 | MedQAQuestion Answering | —Unverified | 0 |
| LM^2: A Simple Society of Language Models Solves Complex Reasoning | Apr 2, 2024 | MathMedQA | CodeCode Available | 0 |