SOTAVerified

Safety Alignment

Papers

Showing 5175 of 288 papers

TitleStatusHype
All Languages Matter: On the Multilingual Safety of Large Language ModelsCode1
MPO: Multilingual Safety Alignment via Reward Gap OptimizationCode1
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference DatasetCode1
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query LanguageCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teamingCode1
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt TypesCode1
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning AttackCode1
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
Improving LLM Safety Alignment with Dual-Objective OptimizationCode1
Lifelong Safety Alignment for Language ModelsCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Bayesian scaling laws for in-context learningCode1
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt TemplatesCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
Locking Down the Finetuned LLMs SafetyCode1
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
Linear Control of Test Awareness Reveals Differential Compliance in Reasoning ModelsCode1
Show:102550
← PrevPage 3 of 12Next →

No leaderboard results yet.