SOTAVerified

Safety Alignment

Papers

Showing 51100 of 288 papers

TitleStatusHype
Probing the Robustness of Large Language Models Safety to Latent PerturbationsCode1
PARDEN, Can You Repeat That? Defending against Jailbreaks via RepetitionCode1
Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting MitigationCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning PerturbationCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language ModelsCode1
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Bayesian scaling laws for in-context learningCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Locking Down the Finetuned LLMs SafetyCode1
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsCode1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguardsCode1
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language ModelsCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
SuperHF: Supervised Iterative Learning from Human FeedbackCode1
Can Editing LLMs Inject Harm?Code1
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?0
Backtracking for Safety0
Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification0
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing0
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models0
Mitigating Unsafe Feedback with Learning Constraints0
Deceptive Alignment Monitoring0
aiXamine: Simplified LLM Safety and Security0
LLM-Safety Evaluations Lack Robustness0
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning0
AI Awareness0
AI Alignment at Your Discretion0
Cross-Modal Safety Alignment: Is textual unlearning all you need?0
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements0
Show:102550
← PrevPage 2 of 6Next →

No leaderboard results yet.