SOTAVerified

Safety Alignment

Papers

Showing 201250 of 288 papers

TitleStatusHype
EnJa: Ensemble Jailbreak on Large Language Models0
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
Can Editing LLMs Inject Harm?Code1
Can Large Language Models Automatically Jailbreak GPT-4V?0
Course-Correction: Safety Alignment Using Synthetic PreferencesCode1
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models0
The Better Angels of Machine Personality: How Personality Relates to LLM SafetyCode0
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture0
Jailbreak Attacks and Defenses Against Large Language Models: A Survey0
Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting MitigationCode1
From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak AttacksCode1
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity GuidanceCode0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization0
Cross-Modality Safety AlignmentCode2
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference0
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch0
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetCode1
Finding Safety Neurons in Large Language Models0
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language ModelsCode1
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and ActivationsCode1
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language ModelCode1
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models0
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models0
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner0
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept0
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Cross-Modal Safety Alignment: Is textual unlearning all you need?0
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks0
Robustifying Safety-Aligned Large Language Models through Clean Data Curation0
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching0
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response0
Safety Alignment for Vision Language Models0
PARDEN, Can You Repeat That? Defending against Jailbreaks via RepetitionCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Uncovering Safety Risks of Large Language Models through Concept Activation VectorCode1
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful KnowledgeCode0
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game0
Show:102550
← PrevPage 5 of 6Next →

No leaderboard results yet.