SOTAVerified

Safety Alignment

Papers

Showing 51100 of 288 papers

TitleStatusHype
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising UsabilityCode1
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking AttacksCode1
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language ModelsCode1
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and ActivationsCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
sudo rm -rf agentic_securityCode1
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language ModelCode1
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language ModelsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple InteractionsCode1
SuperHF: Supervised Iterative Learning from Human FeedbackCode1
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak AttacksCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Safety Subspaces are Not Distinct: A Fine-Tuning Case StudyCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Lifelong Safety Alignment for Language ModelsCode1
Locking Down the Finetuned LLMs SafetyCode1
Bayesian scaling laws for in-context learningCode1
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning PerturbationCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt TypesCode1
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Can Editing LLMs Inject Harm?Code1
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity GuidanceCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementCode0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Overriding Safety protections of Open-source ModelsCode0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Can a large language model be a gaslighter?Code0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.