SOTAVerified

Safety Alignment

Papers

Showing 2650 of 288 papers

TitleStatusHype
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language ModelsCode1
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language ModelsCode1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Lifelong Safety Alignment for Language ModelsCode1
Locking Down the Finetuned LLMs SafetyCode1
Can Editing LLMs Inject Harm?Code1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat TemplatesCode1
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Show:102550
← PrevPage 2 of 12Next →

No leaderboard results yet.