| Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models | Feb 17, 2025 | Safety Alignment | —Unverified | 0 |
| ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization | Apr 3, 2025 | Safety Alignment | —Unverified | 0 |
| EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions | May 29, 2025 | Safety Alignment | —Unverified | 0 |
| Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Nov 27, 2024 | Image GenerationSafety Alignment | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Feb 28, 2025 | Safety Alignment | —Unverified | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 |
| Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh | Mar 3, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models | Apr 14, 2025 | Persuasion StrategiesSafety Alignment | —Unverified | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 |
| LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models | Jul 3, 2024 | Safety Alignment | —Unverified | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 |