SOTAVerified

Safety Alignment

Papers

Showing 51100 of 288 papers

TitleStatusHype
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment0
Safety Subspaces are Not Distinct: A Fine-Tuning Case StudyCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
sudoLLM : On Multi-role Alignment of Language Models0
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking AttacksCode2
Safety Alignment Can Be Not Superficial With Explicit Safety Signals0
JULI: Jailbreak Large Language Models by Self-Introspection0
SafeVid: Toward Safety Aligned Video Large Multimodal Models0
Noise Injection Systemically Degrades Large Language Model Safety Guardrails0
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought CorrectionCode2
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs0
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning ModelCode0
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift0
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsCode0
AI Awareness0
aiXamine: Simplified LLM Safety and Security0
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization0
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak DefenderCode1
LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank AdaptationCode2
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data0
Effectively Controlling Reasoning Models through Thinking Intervention0
VPO: Aligning Text-to-Video Generation Models with Prompt OptimizationCode1
sudo rm -rf agentic_securityCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models0
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model MergingCode1
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
Backtracking for Safety0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.