Towards the Worst-case Robustness of Large Language Models
Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against any possible attack with an average _0 perturbation of 2.02 or an average suffix length of 6.41.