SOTAVerified

Safety Alignment

Papers

Showing 151200 of 288 papers

TitleStatusHype
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models0
Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets0
Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region0
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs0
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment0
AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)0
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization0
AI Alignment at Your Discretion0
AI Awareness0
aiXamine: Simplified LLM Safety and Security0
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models0
Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey0
Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data0
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications0
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment0
Backtracking for Safety0
Backtracking Improves Generation Safety0
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement0
C3AI: Crafting and Evaluating Constitutions for Constitutional AI0
Can Large Language Models Automatically Jailbreak GPT-4V?0
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs0
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements0
Cross-Modal Safety Alignment: Is textual unlearning all you need?0
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
Deceptive Alignment Monitoring0
Mitigating Unsafe Feedback with Learning Constraints0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
Enhancing Jailbreak Attacks with Diversity Guidance0
Effectively Controlling Reasoning Models through Thinking Intervention0
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models0
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens0
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning0
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment0
EnJa: Ensemble Jailbreak on Large Language Models0
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine0
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts0
Show:102550
← PrevPage 4 of 6Next →

No leaderboard results yet.