Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 126–150 of 288 papers

Title	Date	Tasks	Status	Score
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models	Oct 1, 2024	Safety Alignment	CodeCode Available	5
LLM Safety Alignment is Divergence Estimation in Disguise	Feb 2, 2025	Language ModelingLanguage Modelling	CodeCode Available	5
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety	May 11, 2025	Outlier DetectionRed Teaming	CodeCode Available	5
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage	Jun 3, 2025	Prompt EngineeringRed Teaming	CodeCode Available	5
Can a large language model be a gaslighter?	Oct 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	5
DiaBlo: Diagonal Blocks Are Sufficient For Finetuning	Jun 3, 2025	Arithmetic ReasoningCode Generation	CodeCode Available	5
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors	Jun 12, 2025	Question AnsweringSafety Alignment	CodeCode Available	5
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment	Nov 5, 2024	QuantizationSafety Alignment	CodeCode Available	5
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models	Feb 17, 2025	Safety Alignment	CodeCode Available	5
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks	Oct 23, 2024	Instruction FollowingSafety Alignment	—Unverified	0
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models	Oct 5, 2024	Language ModellingMachine Translation	—Unverified	0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message	Jul 7, 2025	Image GenerationSafety Alignment	—Unverified	0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey	Feb 11, 2025	Safety AlignmentSurvey	—Unverified	0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models	Jan 23, 2025	Safety Alignment	—Unverified	0
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data	Jul 8, 2025	ChatbotInstruction Following	—Unverified	0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary	May 23, 2025	Safety Alignment	—Unverified	0
Understanding and Rectifying Safety Perception Distortion in VLMs	Feb 18, 2025	DisentanglementSafety Alignment	—Unverified	0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models	Nov 6, 2024	Safety Alignment	—Unverified	0
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models	Oct 11, 2024	Safety Alignment	—Unverified	0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs	Mar 10, 2025	Binary ClassificationSafety Alignment	—Unverified	0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization	Apr 17, 2025	Multimodal ReasoningSafety Alignment	—Unverified	0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap	Feb 14, 2025	AttributeSafety Alignment	—Unverified	0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning	Jun 4, 2025	Safety Alignment	—Unverified	0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing	Feb 4, 2025	Safety Alignment	—Unverified	0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift	Apr 28, 2025	AttributeData Poisoning	—Unverified	0

Show:10 25 50

← PrevPage 6 of 12Next →

No leaderboard results yet.