AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

2024-02-15Code Available2· sign in to hype

Zhihao Fan, Jialong Tang, Wei Chen, Siyuan Wang, Zhongyu Wei, Jun Xi, Fei Huang, Jingren Zhou

Code Available — Be the first to reproduce this paper.

Code

github.com/LibertFan/AI_Hospital
OfficialIn papernone★ 189

Abstract

Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce AI Hospital, a multi-agent framework simulating dynamic medical interactions between Doctor as player and NPCs including Patient, Examiner, Chief Physician. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at https://github.com/LibertFan/AI_Hospital.

Tasks

Benchmarking Diagnostic Medical Question Answering Question Answering

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Code

Abstract

Tasks

Reproductions