| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 | 0 |
| From Evaluation to Defense: Advancing Safety in Video Large Language Models | May 22, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring | Jun 11, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | May 20, 2025 | Image GenerationRed Teaming | —Unverified | 0 | 0 |
| Internal Activation as the Polar Star for Steering Unsafe LLM Behavior | Feb 3, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Oct 10, 2023 | In-Context LearningLanguage Modelling | —Unverified | 0 | 0 |
| Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Jul 5, 2024 | Code CompletionQuestion Answering | —Unverified | 0 | 0 |
| Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models | Aug 30, 2023 | DecoderSafety Alignment | —Unverified | 0 | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| JULI: Jailbreak Large Language Models by Self-Introspection | May 17, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning | Jul 6, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Apr 3, 2024 | Prompt EngineeringSafety Alignment | —Unverified | 0 | 0 |
| Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh | Mar 3, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models | Apr 14, 2025 | Persuasion StrategiesSafety Alignment | —Unverified | 0 | 0 |
| LLM-Safety Evaluations Lack Robustness | Mar 4, 2025 | Red TeamingResponse Generation | —Unverified | 0 | 0 |
| LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Feb 24, 2024 | Adversarial AttackSafety Alignment | —Unverified | 0 | 0 |
| LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Oct 31, 2023 | GPURed Teaming | —Unverified | 0 | 0 |
| LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models | Jul 3, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 | 0 |
| Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models | Jun 12, 2024 | Instruction FollowingSafety Alignment | —Unverified | 0 | 0 |
| Model Card and Evaluations for Claude Models | Jul 11, 2023 | Arithmetic ReasoningBug fixing | —Unverified | 0 | 0 |
| Model-Editing-Based Jailbreak against Safety-aligned Large Language Models | Dec 11, 2024 | Model EditingSafety Alignment | —Unverified | 0 | 0 |
| Model Merging and Safety Alignment: One Bad Model Spoils the Bunch | Jun 20, 2024 | modelSafety Alignment | —Unverified | 0 | 0 |
| More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Apr 3, 2025 | ARCHellaSwag | —Unverified | 0 | 0 |
| Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture | Jul 10, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Dec 10, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models | Apr 29, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| No Free Lunch for Defending Against Prefilling Attack by In-Context Learning | Dec 13, 2024 | In-Context LearningSafety Alignment | —Unverified | 0 | 0 |
| Noise Injection Systemically Degrades Large Language Model Safety Guardrails | May 16, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks | May 25, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| Off-Policy Risk Assessment in Markov Decision Processes | Sep 21, 2022 | Multi-Armed BanditsSafety Alignment | —Unverified | 0 | 0 |
| One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | May 12, 2025 | Code GenerationSafety Alignment | —Unverified | 0 | 0 |
| RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Nov 16, 2023 | Backdoor AttackData Poisoning | —Unverified | 0 | 0 |
| On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept | Jun 4, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 | 0 |
| PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Sep 21, 2024 | Multi-agent Reinforcement LearningSafety Alignment | —Unverified | 0 | 0 |
| PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning | Nov 28, 2024 | Federated Learningparameter-efficient fine-tuning | —Unverified | 0 | 0 |
| PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference | Jun 20, 2024 | Question AnsweringSafety Alignment | —Unverified | 0 | 0 |
| Playing Language Game with LLMs Leads to Jailbreaking | Nov 16, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing | May 27, 2025 | counterfactualDiversity | —Unverified | 0 | 0 |
| Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Aug 20, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models | Jan 7, 2025 | Image GenerationSafety Alignment | —Unverified | 0 | 0 |
| RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Apr 14, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions | Feb 8, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation | Jun 9, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model | Mar 13, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |
| Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models | May 26, 2025 | Safety Alignment | —Unverified | 0 | 0 |
| Robustifying Safety-Aligned Large Language Models through Clean Data Curation | May 24, 2024 | Safety Alignment | —Unverified | 0 | 0 |
| SafeArena: Evaluating the Safety of Autonomous Web Agents | Mar 6, 2025 | MisinformationSafety Alignment | —Unverified | 0 | 0 |
| SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? | May 29, 2025 | DiagnosticRed Teaming | —Unverified | 0 | 0 |
| SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety | May 26, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 | 0 |