SOTAVerified

Safety Alignment

Papers

Showing 101125 of 288 papers

TitleStatusHype
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their VulnerabilitiesCode0
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
Iterative Self-Tuning LLMs for Enhanced Jailbreaking CapabilitiesCode0
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity GuidanceCode0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
Can a large language model be a gaslighter?Code0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful KnowledgeCode0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy OptimizationCode0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability DistributionsCode0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsCode0
Show:102550
← PrevPage 5 of 12Next →

No leaderboard results yet.