SOTAVerified

Safety Alignment

Papers

Showing 151175 of 288 papers

TitleStatusHype
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data0
Effectively Controlling Reasoning Models through Thinking Intervention0
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models0
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Backtracking for Safety0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety0
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning0
LLM-Safety Evaluations Lack Robustness0
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh0
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts0
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence0
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment0
C3AI: Crafting and Evaluating Constitutions for Constitutional AI0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Show:102550
← PrevPage 7 of 12Next →

No leaderboard results yet.