SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 11–20 of 288 papers

Title	Date	Tasks	Status	Hype
Safety Alignment Should Be Made More Than Just a Few Tokens Deep	Jun 10, 2024	Safety Alignment	CodeCode Available	2
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States	Jun 9, 2024	Safety Alignment	CodeCode Available	2
AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs	Apr 11, 2024	Safety Alignment	CodeCode Available	2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion	Mar 12, 2024	Code CompletionSafety Alignment	CodeCode Available	2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers	Feb 25, 2024	In-Context LearningSafety Alignment	CodeCode Available	2
Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning	Feb 21, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs	Feb 19, 2024	Safety Alignment	CodeCode Available	2
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models	Feb 3, 2024	Instruction FollowingSafety Alignment	CodeCode Available	2
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!	Oct 5, 2023	Red TeamingSafety Alignment	CodeCode Available	2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher	Aug 12, 2023	EthicsRed Teaming	CodeCode Available	2

Show:10 25 50

← PrevPage 2 of 29Next →

No leaderboard results yet.