| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Oct 23, 2023 | Adversarial AttackBlocking | CodeCode Available | 1 |
| Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Oct 16, 2023 | Adversarial AttackFederated Learning | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Oct 5, 2023 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Oct 4, 2023 | GPUSafety Alignment | —Unverified | 0 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |