| Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Jul 21, 2024 | Instruction FollowingLanguage Modelling | —Unverified | 0 | 0 |
| Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Jun 5, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region | Feb 19, 2025 | Decision MakingSafety Alignment | —Unverified | 0 | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 | 0 |
| X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Apr 15, 2025 | DiversityRed Teaming | —Unverified | 0 | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 | 0 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 | 0 |
| AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Jun 10, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 | 0 |