SOTAVerified

Safety Alignment

Papers

Showing 101150 of 288 papers

TitleStatusHype
Backtracking Improves Generation Safety0
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap0
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement0
C3AI: Crafting and Evaluating Constitutions for Constitutional AI0
Can Large Language Models Automatically Jailbreak GPT-4V?0
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues0
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs0
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements0
Cross-Modal Safety Alignment: Is textual unlearning all you need?0
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
Deceptive Alignment Monitoring0
Mitigating Unsafe Feedback with Learning Constraints0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning0
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?0
Enhancing Jailbreak Attacks with Diversity Guidance0
Effectively Controlling Reasoning Models through Thinking Intervention0
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models0
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens0
Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning0
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment0
EnJa: Ensemble Jailbreak on Large Language Models0
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine0
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models0
ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization0
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions0
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models0
FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning0
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts0
Finding Safety Neurons in Large Language Models0
From Evaluation to Defense: Advancing Safety in Video Large Language Models0
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior0
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations0
Jailbreak Attacks and Defenses Against Large Language Models: A Survey0
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models0
JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing0
JULI: Jailbreak Large Language Models by Self-Introspection0
Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning0
Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game0
Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
LLM-Safety Evaluations Lack Robustness0
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper0
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B0
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models0
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.