| From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents | Jun 18, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs | Jun 13, 2025 | Medical Question AnsweringMedQA | —Unverified | 0 |
| MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models | Jun 12, 2025 | Image SegmentationMedical Diagnosis | —Unverified | 0 |
| Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection | Jun 11, 2025 | Medical Question AnsweringMedQA | CodeCode Available | 0 |
| ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning | Jun 11, 2025 | Medical Question AnsweringQuestion Answering | CodeCode Available | 2 |
| ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases | May 30, 2025 | Medical Question AnsweringMultiple-choice | —Unverified | 0 |
| Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs | May 30, 2025 | Fact CheckingHallucination | —Unverified | 0 |
| MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering | May 29, 2025 | Medical Question AnsweringQuestion Answering | —Unverified | 0 |
| ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room | May 28, 2025 | Medical Question AnsweringQuestion Answering | —Unverified | 0 |
| AMQA: An Adversarial Dataset for Benchmarking Bias of LLMs in Medicine and Healthcare | May 26, 2025 | BenchmarkingMedical Diagnosis | CodeCode Available | 0 |