SOTAVerified

Safety Alignment

Papers

Showing 151175 of 288 papers

TitleStatusHype
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models0
Model Card and Evaluations for Claude Models0
Model-Editing-Based Jailbreak against Safety-aligned Large Language Models0
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch0
More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment0
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture0
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars0
NeuRel-Attack: Neuron Relearning for Safety Disalignment in Large Language Models0
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning0
Noise Injection Systemically Degrades Large Language Model Safety Guardrails0
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks0
Off-Policy Risk Assessment in Markov Decision Processes0
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning0
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference0
Playing Language Game with LLMs Leads to Jailbreaking0
PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing0
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety0
SafeVid: Toward Safety Aligned Video Large Multimodal Models0
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning0
SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Show:102550
← PrevPage 7 of 12Next →

No leaderboard results yet.