SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 11–20 of 288 papers

Title	Date	Tasks	Status	Hype
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	Oct 14, 2024	LLM JailbreakSafety Alignment	CodeCode Available	2
Safety Alignment Should Be Made More Than Just a Few Tokens Deep	Jun 10, 2024	Safety Alignment	CodeCode Available	2
Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning	Feb 21, 2024	Instruction FollowingLanguage Modeling	CodeCode Available	2
STAIR: Improving Safety Alignment with Introspective Reasoning	Feb 4, 2025	Safety Alignment	CodeCode Available	2
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher	Aug 12, 2023	EthicsRed Teaming	CodeCode Available	2
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs	Feb 19, 2024	Safety Alignment	CodeCode Available	2
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers	Feb 25, 2024	In-Context LearningSafety Alignment	CodeCode Available	2
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion	Mar 12, 2024	Code CompletionSafety Alignment	CodeCode Available	2
PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks	May 20, 2025	LLM JailbreakSafety Alignment	CodeCode Available	2
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation	Jan 29, 2025	Red TeamingSafety Alignment	CodeCode Available	2

Show:10 25 50

← PrevPage 2 of 29Next →

No leaderboard results yet.