A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

2023-03-11Code Available2· sign in to hype

Zuobai Zhang, Chuanrui Wang, Minghao Xu, Vijil Chenthamarakshan, Aurélie Lozano, Payel Das, Jian Tang

Code Available — Be the first to reproduce this paper.

Code

github.com/deepgraphlearning/gearnet
OfficialIn paperpytorch★ 317
github.com/deepgraphlearning/esm-gearnet
OfficialIn paperpytorch★ 111
github.com/deepgraphlearning/esm-s
pytorch★ 38

Abstract

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.

Tasks

Contrastive Learning Protein Function Prediction Representation Learning

A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

Code

Abstract

Tasks

Reproductions