SOTAVerified

Safety Alignment

Papers

Showing 251275 of 288 papers

TitleStatusHype
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
Enhancing Jailbreak Attacks with Diversity Guidance0
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper0
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement0
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications0
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking0
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their VulnerabilitiesCode0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
SuperHF: Supervised Iterative Learning from Human FeedbackCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks0
Show:102550
← PrevPage 11 of 12Next →

No leaderboard results yet.