SOTAVerified

Safety Alignment

Papers

Showing 125 of 288 papers

TitleStatusHype
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A SurveyCode3
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety AnalysisCode3
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMsCode2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank AdaptationCode2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
STAIR: Improving Safety Alignment with Introspective ReasoningCode2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
Cross-Modality Safety AlignmentCode2
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionCode2
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesCode2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking AttacksCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak DefenderCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Show:102550
← PrevPage 1 of 12Next →

No leaderboard results yet.