| MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports | May 16, 2025 | DiagnosticMath | CodeCode Available | 1 | 5 |
| Towards Expert-Level Medical Question Answering with Large Language Models | May 16, 2023 | Medical Question AnsweringMedQA | CodeCode Available | 1 | 5 |
| TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification | May 23, 2025 | MedQA | CodeCode Available | 0 | 5 |
| LM^2: A Simple Society of Language Models Solves Complex Reasoning | Apr 2, 2024 | MathMedQA | CodeCode Available | 0 | 5 |
| MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge | Jun 5, 2024 | MedQA | CodeCode Available | 0 | 5 |
| Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology | Apr 24, 2023 | BenchmarkingDecision Making | CodeCode Available | 0 | 5 |
| Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection | Jun 11, 2025 | Medical Question AnsweringMedQA | CodeCode Available | 0 | 5 |
| Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering | Mar 7, 2024 | Information RetrievalLanguage Modelling | CodeCode Available | 0 | 5 |
| DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents | Mar 30, 2023 | Conversation SummarizationLanguage Modeling | CodeCode Available | 0 | 5 |
| Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks | Jun 17, 2024 | MedQA | CodeCode Available | 0 | 5 |