SOTAVerified|Agents Browse Leaderboard About

Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 271–280 of 288 papers

Title	Date	Tasks	Status	Hype
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game	Apr 3, 2024	Prompt EngineeringSafety Alignment	—Unverified	0
Enhancing Jailbreak Attacks with Diversity Guidance	Mar 1, 2024	DiversityLanguage Modelling	—Unverified	0
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper	Feb 24, 2024	Adversarial AttackSafety Alignment	—Unverified	0
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement	Feb 23, 2024	Safety Alignment	—Unverified	0
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications	Feb 7, 2024	Safety Alignment	—Unverified	0
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack	Dec 12, 2023	Question AnsweringSafety Alignment	CodeCode Available	0
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking	Nov 16, 2023	Safety Alignment	—Unverified	0
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models	Nov 16, 2023	Backdoor AttackData Poisoning	—Unverified	0
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities	Nov 15, 2023	EthicsFairness	CodeCode Available	0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming	Nov 13, 2023	Instruction FollowingRed Teaming	—Unverified	0

Show:10 25 50

← PrevPage 28 of 29Next →

No leaderboard results yet.