Extracting Prompts by Inverting LLM Outputs

2024-05-23Code Available2· sign in to hype

Collin Zhang, John X. Morris, Vitaly Shmatikov

Code Available — Be the first to reproduce this paper.

Code

github.com/collinzrj/output2prompt
OfficialIn paperpytorch★ 57

Abstract

We consider the problem of language model inversion: given outputs of a language model, we seek to extract the prompt that generated these outputs. We develop a new black-box method, output2prompt, that learns to extract prompts without access to the model's logits and without adversarial or jailbreaking queries. In contrast to previous work, output2prompt only needs outputs of normal user queries. To improve memory efficiency, output2prompt employs a new sparse encoding techique. We measure the efficacy of output2prompt on a variety of user and system prompts and demonstrate zero-shot transferability across different LLMs.

Tasks

Language Modeling Language Modelling

Extracting Prompts by Inverting LLM Outputs

Code

Abstract

Tasks

Reproductions