SOTAVerified

Safety Alignment

Papers

Showing 1120 of 288 papers

TitleStatusHype
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
Show:102550
← PrevPage 2 of 29Next →

No leaderboard results yet.