SOTAVerified

Safety Alignment

Papers

Showing 251288 of 288 papers

TitleStatusHype
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy OptimizationCode0
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsCode0
SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity GuidanceCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks SafetyCode0
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety AlignmentCode0
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP ProtectionCode0
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential MonitorsCode0
Mitigating Safety Fallback in Editing-based Backdoor Injection on LLMsCode0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsCode0
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language ModelsCode0
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMsCode0
Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning ModelsCode0
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing OptimizationCode0
Don't Command, Cultivate: An Exploratory Study of System-2 AlignmentCode0
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementCode0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context AttackCode0
LLM Safety Alignment is Divergence Estimation in DisguiseCode0
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety AlignmentCode0
Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacksCode0
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language ModelsCode0
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question AnsweringCode0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Iterative Self-Tuning LLMs for Enhanced Jailbreaking CapabilitiesCode0
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their VulnerabilitiesCode0
DiaBlo: Diagonal Blocks Are Sufficient For FinetuningCode0
VSCBench: Bridging the Gap in Vision-Language Model Safety CalibrationCode0
SafeWorld: Geo-Diverse Safety AlignmentCode0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as OptimizerCode0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual UnderstandingCode0
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task LinkageCode0
SAGE: A Generic Framework for LLM Safety EvaluationCode0
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic EmbeddingsCode0
The Better Angels of Machine Personality: How Personality Relates to LLM SafetyCode0
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data SynthesisCode0
Can a large language model be a gaslighter?Code0
Show:102550
← PrevPage 6 of 6Next →

No leaderboard results yet.