SOTAVerified

Safety Alignment

Papers

Showing 101150 of 288 papers

TitleStatusHype
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety0
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning0
LLM-Safety Evaluations Lack Robustness0
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh0
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking AttacksCode1
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts0
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence0
C3AI: Crafting and Evaluating Constitutions for Constitutional AI0
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment0
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Understanding and Rectifying Safety Perception Distortion in VLMs0
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsCode0
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models0
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsCode0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language ModelsCode1
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising UsabilityCode1
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety AnalysisCode3
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
AI Alignment at Your Discretion0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple InteractionsCode1
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
STAIR: Improving Safety Alignment with Introspective ReasoningCode2
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior0
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models0
LLM Safety Alignment is Divergence Estimation in DisguiseCode0
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning0
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning PerturbationCode1
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail ModerationCode2
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM JailbreakingCode1
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacksCode0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation0
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models0
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageCode0
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning0
SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation0
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.