| Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations | Jun 17, 2024 | AI and SafetyQuestion Answering | CodeCode Available | 1 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Aug 20, 2024 | AI and SafetyDiversity | CodeCode Available | 1 |