SOTAVerified

Safety Alignment

Papers

Showing 5175 of 288 papers

TitleStatusHype
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language ModelsCode1
Probing the Robustness of Large Language Models Safety to Latent PerturbationsCode1
LookAhead Tuning: Safer Language Models via Partial Answer PreviewsCode1
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language ModelsCode1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference OptimizationCode1
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety PromptCode1
Autonomous Microscopy Experiments through Large Language Model AgentsCode1
Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety AlignmentCode1
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-AlignmentCode1
Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting MitigationCode1
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding SpaceCode1
MLLM-Protector: Ensuring MLLM's Safety without Hurting PerformanceCode1
Red-Teaming Large Language Models using Chain of Utterances for Safety-AlignmentCode1
PARDEN, Can You Repeat That? Defending against Jailbreaks via RepetitionCode1
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference DatasetCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
Don't Say No: Jailbreaking LLM by Suppressing RefusalCode1
Bayesian scaling laws for in-context learningCode1
OR-Bench: An Over-Refusal Benchmark for Large Language ModelsCode1
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual PromptsCode1
Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less ReasonableCode1
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!Code1
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning PerturbationCode1
Show:102550
← PrevPage 3 of 12Next →

No leaderboard results yet.