Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

2025-02-10Code Available0· sign in to hype

Zhiqiang Zhong, Simon Sataa-Yu Larsen, Haoyu Guo, Tao Tang, Kuangyu Zhou, Davide Mottin

Code Available — Be the first to reproduce this paper.

Code

github.com/zhiqiangzhongddu/la3
In papernone★ 0

Abstract

Recent advancements in AI for biological research focus on integrating molecular data with natural language to accelerate drug discovery. However, the scarcity of high-quality annotations limits progress in this area. This paper introduces LA^3, a Language-based Automatic Annotation Augmentation framework that leverages large language models to augment existing datasets, thereby improving AI training. We demonstrate the effectiveness of LA^3 by creating an enhanced dataset, LaChEBI-20, where we systematically rewrite the annotations of molecules from an established dataset. These rewritten annotations preserve essential molecular information while providing more varied sentence structures and vocabulary. Using LaChEBI-20, we train LaMolT5 based on a benchmark architecture to learn the mapping between molecular representations and augmented annotations. Experimental results on text-based *de novo* molecule generation and molecule captioning demonstrate that LaMolT5 outperforms state-of-the-art models. Notably, incorporating LA^3 leads to improvements of up to 301% over the benchmark architecture. Furthermore, we validate the effectiveness of LA^3 notable applications in *image*, *text* and *graph* tasks, affirming its versatility and utility.

Tasks

Drug Discovery Molecule Captioning Sentence Text-based de novo Molecule Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ChEBI-20	LaMolT5-Large	BLEU-2	60.2	—	Unverified
ChEBI-20	LaMolT5-Base	BLEU-2	57.4	—	Unverified
ChEBI-20	LaMolT5-Small	BLEU-2	53.9	—	Unverified

Automatic Annotation Augmentation Boosts Translation between Molecules and Natural Language

Code

Abstract

Tasks

Benchmark Results

Reproductions