SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 21–30 of 288 papers

Title	Date	Tasks	Status	Hype
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender	Apr 13, 2025	Safety Alignment	CodeCode Available	1
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts	Nov 9, 2023	Optical Character Recognition (OCR)Safety Alignment	CodeCode Available	1
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Oct 11, 2024	Safety Alignment	CodeCode Available	1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!	Feb 19, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset	Jul 10, 2023	Question AnsweringSafety Alignment	CodeCode Available	1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models	Oct 23, 2023	Adversarial AttackBlocking	CodeCode Available	1
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment	Nov 15, 2023	Red TeamingSafety Alignment	CodeCode Available	1
Don't Say No: Jailbreaking LLM by Suppressing Refusal	Apr 25, 2024	Natural Language InferenceSafety Alignment	CodeCode Available	1

Show:10 25 50

← PrevPage 3 of 29Next →

No leaderboard results yet.