| Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Feb 16, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Probing the Robustness of Large Language Models Safety to Latent Perturbations | Jun 19, 2025 | DiagnosticSafety Alignment | CodeCode Available | 1 | 5 |
| LookAhead Tuning: Safer Language Models via Partial Answer Previews | Mar 24, 2025 | PositionSafety Alignment | CodeCode Available | 1 | 5 |
| Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models | May 27, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization | Oct 5, 2023 | AllLanguage Modeling | CodeCode Available | 1 | 5 |
| DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Jun 11, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 | 5 |
| Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Feb 22, 2024 | Backdoor AttackLanguage Modelling | CodeCode Available | 1 | 5 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 | 5 |
| Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation | Jul 4, 2024 | Q-Learningreinforcement-learning | CodeCode Available | 1 | 5 |
| Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space | Feb 14, 2024 | Adversarial RobustnessSafety Alignment | CodeCode Available | 1 | 5 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Aug 18, 2023 | MMLURed Teaming | CodeCode Available | 1 | 5 |
| PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | May 13, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 | 5 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 | 5 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 | 5 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 | 5 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 | 5 |
| OR-Bench: An Over-Refusal Benchmark for Large Language Models | May 31, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 | 5 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering | Aug 21, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |