SOTAVerified

Safety Alignment

Papers

Showing 176200 of 288 papers

TitleStatusHype
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesCode2
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationCode1
Can a large language model be a gaslighter?Code0
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationCode1
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements0
Superficial Safety Alignment Hypothesis0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models0
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks0
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks0
Towards Inference-time Category-wise Safety Steering for Large Language Models0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A SurveyCode3
Backtracking Improves Generation Safety0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach0
Mitigating Unsafe Feedback with Learning Constraints0
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as OptimizerCode0
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation0
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability DistributionsCode0
Show:102550
← PrevPage 8 of 12Next →

No leaderboard results yet.