SOTAVerified

Safety Alignment

Papers

Showing 151200 of 288 papers

TitleStatusHype
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models0
Model Card and Evaluations for Claude Models0
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models0
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture0
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars0
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models0
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning0
Noise Injection Systemically Degrades Large Language Model Safety Guardrails0
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks0
Off-Policy Risk Assessment in Markov Decision Processes0
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning0
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference0
Playing Language Game with LLMs Leads to Jailbreaking0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety0
SafeVid: Toward Safety Aligned Video Large Multimodal Models0
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning0
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation0
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks0
Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks0
SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression0
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack0
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner0
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models0
Shape it Up! Restoring LLM Safety during Finetuning0
Smaller Large Language Models Can Do Moral Self-Correction0
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge0
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models0
SPIN: Self-Supervised Prompt INjection0
STAR-1: Safer Alignment of Reasoning LLMs with 1K Data0
sudoLLM : On Multi-role Alignment of Language Models0
Superficial Safety Alignment Hypothesis0
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks0
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models0
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models0
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching0
Towards Inference-time Category-wise Safety Steering for Large Language Models0
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization0
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare0
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks0
Show:102550
← PrevPage 4 of 6Next →

No leaderboard results yet.