| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |
| Don't Say No: Jailbreaking LLM by Suppressing Refusal | Apr 25, 2024 | Natural Language InferenceSafety Alignment | CodeCode Available | 1 | 5 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 | 5 |
| MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance | Jan 5, 2024 | Safety Alignment | CodeCode Available | 1 | 5 |
| FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Nov 9, 2023 | Optical Character Recognition (OCR)Safety Alignment | CodeCode Available | 1 | 5 |
| Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Mar 1, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset | Jul 10, 2023 | Question AnsweringSafety Alignment | CodeCode Available | 1 | 5 |
| Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! | Feb 19, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 | 5 |
| MPO: Multilingual Safety Alignment via Reward Gap Optimization | May 22, 2025 | Safety Alignment | CodeCode Available | 1 | 5 |