| Uncovering Safety Risks of Large Language Models through Concept Activation Vector | Apr 18, 2024 | Safety Alignment | CodeCode Available | 1 |
| Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates | Feb 28, 2024 | GSM8KSafety Alignment | CodeCode Available | 1 |
| Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Feb 22, 2024 | Backdoor AttackLanguage Modelling | CodeCode Available | 1 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Feb 14, 2024 | Adversarial RobustnessSafety Alignment | CodeCode Available | 1 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| SuperHF: Supervised Iterative Learning from Human Feedback | Oct 25, 2023 | Language ModellingSafety Alignment | CodeCode Available | 1 |
| AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Oct 23, 2023 | Adversarial AttackBlocking | CodeCode Available | 1 |