TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation

2024-12-19Code Available1· sign in to hype

Jiatong Li, Junxian Li, Yunqing Liu, Dongzhan Zhou, Qing Li

Code Available — Be the first to reproduce this paper.

Code

github.com/phenixace/tomg-bench
OfficialIn paperpytorch★ 21

Abstract

In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each major task further contains three subtasks, while each subtask comprises 5,000 test samples. Given the inherent complexity of open molecule generation evaluation, we also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations as well as potential areas for improvement in text-guided molecule discovery. Furthermore, we propose OpenMolIns, a specialized instruction tuning dataset established for solving challenges raised by TOMG-Bench. Fine-tuned on OpenMolIns, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5\% on TOMG-Bench. Our codes and datasets are available through https://github.com/phenixace/TOMG-Bench.

Tasks

Benchmarking Description-guided molecule generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TOMG-Bench	Claude-3.5	wAcc	35.92	—	Unverified
TOMG-Bench	Gemini-1.5-pro	wAcc	34.8	—	Unverified
TOMG-Bench	GPT-4-turbo	wAcc	34.23	—	Unverified
TOMG-Bench	GPT-4o	wAcc	32.29	—	Unverified
TOMG-Bench	Claude-3	wAcc	30.47	—	Unverified
TOMG-Bench	Llama-3.1-8B (OpenMolIns-large)	wAcc	27.22	—	Unverified
TOMG-Bench	Galactica-125M (OpenMolIns-xlarge)	wAcc	25.73	—	Unverified
TOMG-Bench	Llama3-70B-Instruct (INT4)	wAcc	23.93	—	Unverified
TOMG-Bench	Galactica-125M (OpenMolIns-large)	wAcc	23.42	—	Unverified
TOMG-Bench	Galactica-125M (OpenMolIns-medium)	wAcc	19.89	—	Unverified
TOMG-Bench	GPT-3.5-turbo	wAcc	18.58	—	Unverified
TOMG-Bench	Galactica-125M (OpenMolIns-small)	wAcc	15.18	—	Unverified
TOMG-Bench	Llama3.1-8B-Instruct	wAcc	14.09	—	Unverified
TOMG-Bench	Llama3-8B-Instruct	wAcc	13.75	—	Unverified
TOMG-Bench	chatglm-9B	wAcc	13.14	—	Unverified
TOMG-Bench	Galactica-125M (OpenMolIns-light)	wAcc	13.14	—	Unverified
TOMG-Bench	Llama3.2-1B (OpenMolIns-large)	wAcc	8.1	—	Unverified
TOMG-Bench	yi-1.5-9B	wAcc	7.32	—	Unverified
TOMG-Bench	Mistral-7B-Instruct-v0.2	wAcc	4.81	—	Unverified
TOMG-Bench	BioT5-base	wAcc	4.21	—	Unverified
TOMG-Bench	MolT5-large	wAcc	2.89	—	Unverified
TOMG-Bench	Llama-3.1-1B-Instruct	wAcc	1.99	—	Unverified
TOMG-Bench	MolT5-base	wAcc	1.3	—	Unverified
TOMG-Bench	MolT5-small	wAcc	1.3	—	Unverified
TOMG-Bench	Qwen2-7B-Instruct	wAcc	0.15	—	Unverified

TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions