SOTAVerified

Safety Alignment

Papers

Showing 176200 of 288 papers

TitleStatusHype
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsCode0
Understanding and Rectifying Safety Perception Distortion in VLMs0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models0
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsCode0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
AI Alignment at Your Discretion0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior0
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models0
LLM Safety Alignment is Divergence Estimation in DisguiseCode0
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning0
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacksCode0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation0
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models0
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageCode0
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
Show:102550
← PrevPage 8 of 12Next →

No leaderboard results yet.