| Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization | Oct 25, 2024 | Safety Alignment | CodeCode Available | 0 |
| One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | May 23, 2025 | AllSafety Alignment | CodeCode Available | 0 |
| SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Jun 26, 2024 | Safety Alignment | CodeCode Available | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 |
| Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment | Feb 16, 2025 | Safety Alignment | CodeCode Available | 0 |
| DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection | May 22, 2025 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMs | Jun 16, 2025 | DiversityModel Editing | CodeCode Available | 0 |
| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 |
| DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models | Apr 25, 2025 | DisentanglementSafety Alignment | CodeCode Available | 0 |
| Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs | Jun 21, 2025 | Safety Alignment | CodeCode Available | 0 |
| Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models | May 26, 2025 | Safety Alignment | CodeCode Available | 0 |
| Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 0 |
| Don't Command, Cultivate: An Exploratory Study of System-2 Alignment | Nov 26, 2024 | Prompt EngineeringSafety Alignment | CodeCode Available | 0 |
| A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement | Oct 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack | Dec 12, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 0 |
| LLM Safety Alignment is Divergence Estimation in Disguise | Feb 2, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Nov 5, 2024 | QuantizationSafety Alignment | CodeCode Available | 0 |
| Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Jan 18, 2025 | Safety Alignment | CodeCode Available | 0 |
| StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | Feb 17, 2025 | Safety Alignment | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Oct 24, 2024 | Safety Alignment | CodeCode Available | 0 |
| How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities | Nov 15, 2023 | EthicsFairness | CodeCode Available | 0 |
| DiaBlo: Diagonal Blocks Are Sufficient For Finetuning | Jun 3, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 |
| VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration | May 26, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SafeWorld: Geo-Diverse Safety Alignment | Dec 9, 2024 | Safety Alignment | CodeCode Available | 0 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer | Aug 21, 2024 | Safety Alignment | CodeCode Available | 0 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage | Dec 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SAGE: A Generic Framework for LLM Safety Evaluation | Apr 28, 2025 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings | Feb 18, 2025 | GPUSafety Alignment | CodeCode Available | 0 |
| The Better Angels of Machine Personality: How Personality Relates to LLM Safety | Jul 17, 2024 | FairnessSafety Alignment | CodeCode Available | 0 |
| TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | May 30, 2025 | DiversityLanguage Modeling | CodeCode Available | 0 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |