SOTAVerified

Safety Alignment

Papers

Showing 201225 of 288 papers

TitleStatusHype
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation0
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning0
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models0
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars0
SafeWorld: Geo-Diverse Safety AlignmentCode0
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning0
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models0
Don't Command, Cultivate: An Exploratory Study of System-2 AlignmentCode0
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine0
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment0
Playing Language Game with LLMs Leads to Jailbreaking0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models0
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety AlignmentCode0
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Smaller Large Language Models Can Do Moral Self-Correction0
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy OptimizationCode0
Iterative Self-Tuning LLMs for Enhanced Jailbreaking CapabilitiesCode0
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks0
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementCode0
SPIN: Self-Supervised Prompt INjection0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
Can a large language model be a gaslighter?Code0
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Show:102550
← PrevPage 9 of 12Next →

No leaderboard results yet.