SOTAVerified

Safety Alignment

Papers

Showing 226250 of 288 papers

TitleStatusHype
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars0
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models0
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning0
Noise Injection Systemically Degrades Large Language Model Safety Guardrails0
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks0
Off-Policy Risk Assessment in Markov Decision Processes0
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning0
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference0
Playing Language Game with LLMs Leads to Jailbreaking0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation0
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models0
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability0
Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions0
Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation0
Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model0
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models0
Robustifying Safety-Aligned Large Language Models through Clean Data Curation0
SafeArena: Evaluating the Safety of Autonomous Web Agents0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety0
Show:102550
← PrevPage 10 of 12Next →

No leaderboard results yet.