SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 21–30 of 288 papers

Title	Date	Tasks	Status	Hype	Score
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates	Jun 17, 2024	Instruction FollowingSafety Alignment	CodeCode Available	1	5
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender	Apr 13, 2025	Safety Alignment	CodeCode Available	1	5
Course-Correction: Safety Alignment Using Synthetic Preferences	Jul 23, 2024	Safety Alignment	CodeCode Available	1	5
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Oct 11, 2024	Safety Alignment	CodeCode Available	1	5
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1	5
Can Editing LLMs Inject Harm?	Jul 29, 2024	FairnessGeneral Knowledge	CodeCode Available	1	5
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models	Oct 23, 2023	Adversarial AttackBlocking	CodeCode Available	1	5
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1	5
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment	Nov 15, 2023	Red TeamingSafety Alignment	CodeCode Available	1	5
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models	Jun 9, 2025	Multi-agent Reinforcement LearningSafety Alignment	CodeCode Available	1	5

Show:10 25 50

← PrevPage 3 of 29Next →

No leaderboard results yet.