SOTAVerified

Safety Alignment

Papers

Showing 150 of 288 papers

TitleStatusHype
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety AnalysisCode3
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A SurveyCode3
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesCode2
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank AdaptationCode2
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking AttacksCode2
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
Cross-Modality Safety AlignmentCode2
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionCode2
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMsCode2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
STAIR: Improving Safety Alignment with Introspective ReasoningCode2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak DefenderCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Locking Down the Finetuned LLMs SafetyCode1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
Can Editing LLMs Inject Harm?Code1
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.