| SPIN: Self-Supervised Prompt INjection | Oct 17, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | Apr 2, 2025 | DiversitySafety Alignment | —Unverified | 0 | 0 |
| sudoLLM : On Multi-role Alignment of Language Models | May 20, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Superficial Safety Alignment Hypothesis | Oct 7, 2024 | AttributeBinary Classification | —Unverified | 0 | 0 |
| Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Oct 16, 2023 | Adversarial AttackFederated Learning | —Unverified | 0 | 0 |
| The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models | Feb 3, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence | Feb 24, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 | 0 |
| Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models | Apr 18, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching | May 22, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Towards Inference-time Category-wise Safety Steering for Large Language Models | Oct 2, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization | Apr 19, 2025 | Contrastive LearningImage Generation | —Unverified | 0 | 0 |
| Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare | Jan 27, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |