| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models | May 27, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Feb 19, 2025 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 | 5 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 | 5 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 | 5 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 | 5 |