| AI Awareness | Apr 25, 2025 | Safety Alignment | —Unverified | 0 |
| AI Alignment at Your Discretion | Feb 10, 2025 | Safety Alignment | —Unverified | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| Cross-Modal Safety Alignment: Is textual unlearning all you need? | May 27, 2024 | AllSafety Alignment | —Unverified | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 |
| CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | May 16, 2025 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment | Feb 21, 2025 | Safety Alignment | —Unverified | 0 |
| Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Feb 7, 2024 | Safety Alignment | —Unverified | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nov 16, 2023 | Safety Alignment | —Unverified | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 |
| Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh | Mar 3, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Code-Switching Curriculum Learning for Multilingual Transfer in LLMs | Nov 4, 2024 | Cross-Lingual TransferLanguage Acquisition | —Unverified | 0 |
| Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture | Jul 10, 2024 | Safety Alignment | —Unverified | 0 |
| Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Dec 10, 2024 | Safety Alignment | —Unverified | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| Off-Policy Risk Assessment in Markov Decision Processes | Sep 21, 2022 | Multi-Armed BanditsSafety Alignment | —Unverified | 0 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |