| Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Jul 21, 2024 | Instruction FollowingLanguage Modelling | —Unverified | 0 | 0 |
| Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Jun 5, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region | Feb 19, 2025 | Decision MakingSafety Alignment | —Unverified | 0 | 0 |
| WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | May 22, 2024 | LLM JailbreakSafety Alignment | —Unverified | 0 | 0 |
| X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Apr 15, 2025 | DiversityRed Teaming | —Unverified | 0 | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 | 0 |
| From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Jun 7, 2025 | ARCMMLU | —Unverified | 0 | 0 |
| AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Jun 10, 2025 | Adversarial AttackSafety Alignment | —Unverified | 0 | 0 |
| Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization | Jun 24, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| aiXamine: Simplified LLM Safety and Security | Apr 21, 2025 | 2kAdversarial Robustness | —Unverified | 0 | 0 |
| Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification | Mar 14, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models | Jun 2, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey | May 23, 2025 | Active LearningReinforcement Learning (RL) | —Unverified | 0 | 0 |
| Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | May 15, 2025 | Malware DetectionSafety Alignment | —Unverified | 0 | 0 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment | Feb 21, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Backtracking for Safety | Mar 11, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Backtracking Improves Generation Safety | Sep 22, 2024 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap | May 30, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Feb 23, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| C3AI: Crafting and Evaluating Constitutions for Constitutional AI | Feb 21, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 | 0 |
| CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues | Apr 4, 2024 | ChatbotInstruction Following | —Unverified | 0 | 0 |