SOTAVerified

Jogging the Memory of Unlearned LLMs Through Targeted Relearning Attacks

2024-06-19Code Available0· sign in to hype

Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in LLMs. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can "jog" the memory of unlearned models to reverse the effects of unlearning. For example, we show that relearning on public medical articles can lead an unlearned LLM to output harmful knowledge about bioweapons, and relearning general wiki information about the book series Harry Potter can force the model to output verbatim memorized text. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.

Tasks

Reproductions