| Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Oct 1, 2024 | Safety Alignment | CodeCode Available | 0 | 5 |
| LLM Safety Alignment is Divergence Estimation in Disguise | Feb 2, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 | 5 |
| BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Jun 3, 2025 | Prompt EngineeringRed Teaming | CodeCode Available | 0 | 5 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| DiaBlo: Diagonal Blocks Are Sufficient For Finetuning | Jun 3, 2025 | Arithmetic ReasoningCode Generation | CodeCode Available | 0 | 5 |
| Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors | Jun 12, 2025 | Question AnsweringSafety Alignment | CodeCode Available | 0 | 5 |
| Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Nov 5, 2024 | QuantizationSafety Alignment | CodeCode Available | 0 | 5 |
| StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models | Feb 17, 2025 | Safety Alignment | CodeCode Available | 0 | 5 |
| Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks | Oct 23, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 | 0 |
| Toxic Subword Pruning for Dialogue Response Generation on Large Language Models | Oct 5, 2024 | Language ModellingMachine Translation | —Unverified | 0 | 0 |
| Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Jul 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| Trustworthy AI: Safety, Bias, and Privacy -- A Survey | Feb 11, 2025 | Safety AlignmentSurvey | —Unverified | 0 | 0 |
| Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models | Jan 23, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Jul 8, 2025 | ChatbotInstruction Following | —Unverified | 0 | 0 |
| Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary | May 23, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Understanding and Rectifying Safety Perception Distortion in VLMs | Feb 18, 2025 | DisentanglementSafety Alignment | —Unverified | 0 | 0 |
| Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models | Nov 6, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models | Oct 11, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs | Mar 10, 2025 | Binary ClassificationSafety Alignment | —Unverified | 0 | 0 |
| VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization | Apr 17, 2025 | Multimodal ReasoningSafety Alignment | —Unverified | 0 | 0 |
| VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap | Feb 14, 2025 | AttributeSafety Alignment | —Unverified | 0 | 0 |
| Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning | Jun 4, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing | Feb 4, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift | Apr 28, 2025 | AttributeData Poisoning | —Unverified | 0 | 0 |