SOTAVerified

Safety Alignment

Papers

Showing 226250 of 288 papers

TitleStatusHype
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language ModelCode1
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models0
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models0
Safety Alignment Should Be Made More Than Just a Few Tokens DeepCode2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesCode2
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner0
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept0
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Cross-Modal Safety Alignment: Is textual unlearning all you need?0
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks0
Robustifying Safety-Aligned Large Language Models through Clean Data Curation0
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching0
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response0
Safety Alignment for Vision Language Models0
PARDEN, Can You Repeat That? Defending against Jailbreaks via RepetitionCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Uncovering Safety Risks of Large Language Models through Concept Activation VectorCode1
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMsCode2
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful KnowledgeCode0
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game0
Show:102550
← PrevPage 10 of 12Next →

No leaderboard results yet.