SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 51–60 of 288 papers

Title	Date	Tasks	Status	Hype
All Languages Matter: On the Multilingual Safety of Large Language Models	Oct 2, 2023	AllSafety Alignment	CodeCode Available	1
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Oct 11, 2024	Safety Alignment	CodeCode Available	1
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models	May 27, 2024	Safety Alignment	CodeCode Available	1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering	Aug 21, 2024	Safety Alignment	CodeCode Available	1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models	Oct 23, 2023	Adversarial AttackBlocking	CodeCode Available	1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt	Jun 11, 2025	Safety Alignment	CodeCode Available	1
Autonomous Microscopy Experiments through Large Language Model Agents	Dec 18, 2024	BenchmarkingExperimental Design	CodeCode Available	1
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition	May 13, 2024	Safety Alignment	CodeCode Available	1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment	Nov 15, 2023	Red TeamingSafety Alignment	CodeCode Available	1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1

Show:10 25 50

← PrevPage 6 of 29Next →

No leaderboard results yet.