SOTAVerified

Safety Alignment

Papers

Showing 151200 of 288 papers

TitleStatusHype
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data0
Effectively Controlling Reasoning Models through Thinking Intervention0
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models0
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Backtracking for Safety0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety0
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning0
LLM-Safety Evaluations Lack Robustness0
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh0
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts0
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence0
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment0
C3AI: Crafting and Evaluating Constitutions for Constitutional AI0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsCode0
Understanding and Rectifying Safety Perception Distortion in VLMs0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models0
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsCode0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
AI Alignment at Your Discretion0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior0
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models0
LLM Safety Alignment is Divergence Estimation in DisguiseCode0
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning0
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacksCode0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation0
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models0
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageCode0
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
Show:102550
← PrevPage 4 of 6Next →

No leaderboard results yet.