SOTAVerified

Safety Alignment

Papers

Showing 101150 of 288 papers

TitleStatusHype
Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful KnowledgeCode0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability DistributionsCode0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
SafeWorld: Geo-Diverse Safety AlignmentCode0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning ModelCode0
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageCode0
Don't Command, Cultivate: An Exploratory Study of System-2 AlignmentCode0
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacksCode0
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image ModelsCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language ModelsCode0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing OptimizationCode0
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMsCode0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsCode0
LLM Safety Alignment is Divergence Estimation in DisguiseCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream CamouflageCode0
Can a large language model be a gaslighter?Code0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety AlignmentCode0
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsCode0
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks0
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models0
Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message0
Trustworthy AI: Safety, Bias, and Privacy -- A Survey0
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models0
TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data0
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary0
Understanding and Rectifying Safety Perception Distortion in VLMs0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models0
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs0
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization0
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap0
Vulnerability-Aware Alignment: Mitigating Uneven Forgetting in Harmful Fine-Tuning0
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing0
What's Pulling the Strings? Evaluating Integrity and Attribution in AI Training and Inference through Concept Shift0
Show:102550
← PrevPage 3 of 6Next →

No leaderboard results yet.