| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models | Jun 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Jun 17, 2024 | Instruction FollowingSafety Alignment | CodeCode Available | 1 |
| Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations | Jun 17, 2024 | AI and SafetyQuestion Answering | CodeCode Available | 1 |
| SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models | Jun 15, 2024 | Federated LearningLanguage Modelling | —Unverified | 0 |
| Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models | Jun 12, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 |
| Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Jun 10, 2024 | Safety Alignment | CodeCode Available | 2 |
| How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Jun 9, 2024 | Safety Alignment | CodeCode Available | 2 |