SOTAVerified

Safety Alignment

Papers

Showing 51100 of 288 papers

TitleStatusHype
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationCode1
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
Can Editing LLMs Inject Harm?Code1
Course-Correction: Safety Alignment Using Synthetic PreferencesCode1
Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting MitigationCode1
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak AttacksCode1
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetCode1
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language ModelsCode1
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language ModelCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and ActivationsCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
PARDEN, Can You Repeat That? Defending against Jailbreaks via RepetitionCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Uncovering Safety Risks of Large Language Models through Concept Activation VectorCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
SuperHF: Supervised Iterative Learning from Human FeedbackCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBenchCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning0
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks0
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification0
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning0
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMsCode0
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring0
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.