SOTAVerified

Safety Alignment

Papers

Showing 150 of 288 papers

TitleStatusHype
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A SurveyCode3
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety AnalysisCode3
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking AttacksCode2
STAIR: Improving Safety Alignment with Introspective ReasoningCode2
Cross-Modality Safety AlignmentCode2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code CompletionCode2
Self-Distillation Bridges Distribution Gap in Language Model Fine-TuningCode2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM JailbreakersCode2
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionCode2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!Code2
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language ModelsCode2
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMsCode2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via CipherCode2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMsCode2
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank AdaptationCode2
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak DefenderCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationCode1
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Lifelong Safety Alignment for Language ModelsCode1
Locking Down the Finetuned LLMs SafetyCode1
Can Editing LLMs Inject Harm?Code1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Show:102550
← PrevPage 1 of 6Next →

No leaderboard results yet.