| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities | Nov 15, 2023 | EthicsFairness | CodeCode Available | 0 | 5 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 | 5 |
| Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Oct 24, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment | Feb 16, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models | May 29, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Feb 19, 2025 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 | 5 |
| Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Dec 15, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Feb 4, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Apr 8, 2024 | General KnowledgeSafety Alignment | CodeCode Available | 0 | 5 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models | May 27, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Jun 3, 2025 | Prompt EngineeringRed Teaming | CodeCode Available | 0 | 5 |
| Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization | Oct 25, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 | 5 |
| Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models | May 26, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 | 5 |
| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 | 5 |
| Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Aug 14, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| One-Shot Safety Alignment for Large Language Models via Optimal Dualization | May 29, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models | Apr 25, 2025 | DisentanglementSafety Alignment | CodeCode Available | 0 | 5 |