Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 1–50 of 288 papers

Title	Date	Tasks	Status	Hype	Score
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey	Sep 26, 2024	Safety Alignment	CodeCode Available	3	5
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis	Feb 13, 2025	Safety Alignment	CodeCode Available	3	5
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks	May 20, 2025	LLM JailbreakSafety Alignment	CodeCode Available	2	5
STAIR: Improving Safety Alignment with Introspective Reasoning	Feb 4, 2025	Safety Alignment	CodeCode Available	2	5
Cross-Modality Safety Alignment	Jun 21, 2024	Safety Alignment	CodeCode Available	2	5
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation	Jan 29, 2025	Red TeamingSafety Alignment	CodeCode Available	2	5
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion	Mar 12, 2024	Code CompletionSafety Alignment	CodeCode Available	2	5
Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning	Feb 21, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2	5
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers	Feb 25, 2024	In-Context LearningSafety Alignment	CodeCode Available	2	5
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs	Apr 11, 2024	Safety Alignment	CodeCode Available	2	5
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction	May 16, 2025	Contrastive LearningSafety Alignment	CodeCode Available	2	5
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!	Oct 5, 2023	Red TeamingSafety Alignment	CodeCode Available	2	5
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	Oct 14, 2024	LLM JailbreakSafety Alignment	CodeCode Available	2	5
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States	Jun 9, 2024	Safety Alignment	CodeCode Available	2	5
Safety Alignment Should Be Made More Than Just a Few Tokens Deep	Jun 10, 2024	Safety Alignment	CodeCode Available	2	5
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	Feb 3, 2024	Instruction FollowingSafety Alignment	CodeCode Available	2	5
The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs	Jul 15, 2025	Code GenerationSafety Alignment	CodeCode Available	2	5
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher	Aug 12, 2023	EthicsRed Teaming	CodeCode Available	2	5
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs	Feb 19, 2024	Safety Alignment	CodeCode Available	2	5
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation	Apr 10, 2025	Code GenerationContinual Learning	CodeCode Available	2	5
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts	Nov 9, 2023	Optical Character Recognition (OCR)Safety Alignment	CodeCode Available	1	5
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models	Oct 23, 2023	Adversarial AttackBlocking	CodeCode Available	1	5
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender	Apr 13, 2025	Safety Alignment	CodeCode Available	1	5
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming	May 22, 2025	Red TeamingSafety Alignment	CodeCode Available	1	5
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Oct 11, 2024	Safety Alignment	CodeCode Available	1	5
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models	May 27, 2024	Safety Alignment	CodeCode Available	1	5
MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance	Jan 5, 2024	Safety Alignment	CodeCode Available	1	5
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1	5
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment	Nov 15, 2023	Red TeamingSafety Alignment	CodeCode Available	1	5
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1	5
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!	Feb 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	1	5
MPO: Multilingual Safety Alignment via Reward Gap Optimization	May 22, 2025	Safety Alignment	CodeCode Available	1	5
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering	Aug 21, 2024	Safety Alignment	CodeCode Available	1	5
OR-Bench: An Over-Refusal Benchmark for Large Language Models	May 31, 2024	Safety Alignment	CodeCode Available	1	5
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models	Jun 9, 2025	Multi-agent Reinforcement LearningSafety Alignment	CodeCode Available	1	5
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning	Aug 18, 2024	PhilosophySafety Alignment	CodeCode Available	1	5
LookAhead Tuning: Safer Language Models via Partial Answer Previews	Mar 24, 2025	PositionSafety Alignment	CodeCode Available	1	5
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models	Jul 31, 2024	Safety Alignment	CodeCode Available	1	5
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models	May 20, 2025	Safety Alignment	CodeCode Available	1	5
Lifelong Safety Alignment for Language Models	May 26, 2025	Safety Alignment	CodeCode Available	1	5
Locking Down the Finetuned LLMs Safety	Oct 14, 2024	Safety Alignment	CodeCode Available	1	5
Can Editing LLMs Inject Harm?	Jul 29, 2024	FairnessGeneral Knowledge	CodeCode Available	1	5
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language	Feb 13, 2025	Safety Alignment	CodeCode Available	1	5
Don't Say No: Jailbreaking LLM by Suppressing Refusal	Apr 25, 2024	Natural Language InferenceSafety Alignment	CodeCode Available	1	5
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment	Nov 27, 2024	Safety AlignmentVisual Reasoning	CodeCode Available	1	5
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates	Jun 17, 2024	Instruction FollowingSafety Alignment	CodeCode Available	1	5
All Languages Matter: On the Multilingual Safety of Large Language Models	Oct 2, 2023	AllSafety Alignment	CodeCode Available	1	5
Improving LLM Safety Alignment with Dual-Objective Optimization	Mar 5, 2025	Safety Alignment	CodeCode Available	1	5
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt	Jun 11, 2025	Safety Alignment	CodeCode Available	1	5
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization	Oct 5, 2023	AllLanguage Modeling	CodeCode Available	1	5

Show:10 25 50

← PrevPage 1 of 6Next →

No leaderboard results yet.