SOTAVerified

Safety Alignment

Papers

Showing 76100 of 288 papers

TitleStatusHype
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization0
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak DefenderCode1
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank AdaptationCode2
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data0
Effectively Controlling Reasoning Models through Thinking Intervention0
VPO: Aligning Text-to-Video Generation Models with Prompt OptimizationCode1
sudo rm -rf agentic_securityCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models0
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model MergingCode1
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Backtracking for Safety0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
Show:102550
← PrevPage 4 of 12Next →

No leaderboard results yet.