| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 | 0 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 | 0 |