SOTAVerified

Safety Alignment

Papers

Showing 251288 of 288 papers

TitleStatusHype
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
Enhancing Jailbreak Attacks with Diversity Guidance0
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper0
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement0
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications0
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking0
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their VulnerabilitiesCode0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
SuperHF: Supervised Iterative Learning from Human FeedbackCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks0
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations0
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models0
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models0
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
Deceptive Alignment Monitoring0
Model Card and Evaluations for Claude Models0
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
Off-Policy Risk Assessment in Markov Decision Processes0
Show:102550
← PrevPage 6 of 6Next →

No leaderboard results yet.