| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 |
| Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Oct 16, 2023 | Adversarial AttackFederated Learning | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models | Oct 4, 2023 | GPUSafety Alignment | —Unverified | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 |
| Deceptive Alignment Monitoring | Jul 20, 2023 | Safety Alignment | —Unverified | 0 |
| Model Card and Evaluations for Claude Models | Jul 11, 2023 | Arithmetic ReasoningBug fixing | —Unverified | 0 |
| Off-Policy Risk Assessment in Markov Decision Processes | Sep 21, 2022 | Multi-Armed BanditsSafety Alignment | —Unverified | 0 |