SOTAVerified

Safety Alignment

Papers

Showing 76100 of 288 papers

TitleStatusHype
Lifelong Safety Alignment for Language ModelsCode1
Locking Down the Finetuned LLMs SafetyCode1
Bayesian scaling laws for in-context learningCode1
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning PerturbationCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt TypesCode1
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Can Editing LLMs Inject Harm?Code1
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity GuidanceCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementCode0
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive SamplingCode0
One-Shot Safety Alignment for Large Language Models via Optimal DualizationCode0
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language ModelsCode0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
Overriding Safety protections of Open-source ModelsCode0
Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak AttackingCode0
Can a large language model be a gaslighter?Code0
Show:102550
← PrevPage 4 of 12Next →

No leaderboard results yet.