SOTAVerified

Safety Alignment

Papers

Showing 2650 of 288 papers

TitleStatusHype
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language ModelsCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Locking Down the Finetuned LLMs SafetyCode1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
Can Editing LLMs Inject Harm?Code1
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Show:102550
← PrevPage 2 of 12Next →

No leaderboard results yet.