SOTAVerified

Safety Alignment

Papers

Showing 151200 of 288 papers

TitleStatusHype
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars0
SafeWorld: Geo-Diverse Safety AlignmentCode0
PrivAgent: Agentic-based Red-teaming for LLM Privacy LeakageCode1
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning0
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time AlignmentCode1
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models0
Don't Command, Cultivate: An Exploratory Study of System-2 AlignmentCode0
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine0
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment0
Playing Language Game with LLMs Leads to Jailbreaking0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models0
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety AlignmentCode0
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal ModelsCode0
Smaller Large Language Models Can Do Moral Self-Correction0
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt TypesCode1
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy OptimizationCode0
Iterative Self-Tuning LLMs for Enhanced Jailbreaking CapabilitiesCode0
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks0
Bayesian scaling laws for in-context learningCode1
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language ModelsCode0
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementCode0
SPIN: Self-Supervised Prompt INjection0
Locking Down the Finetuned LLMs SafetyCode1
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered CluesCode2
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise PerturbationCode1
Can a large language model be a gaslighter?Code0
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention ManipulationCode1
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements0
Superficial Safety Alignment Hypothesis0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language ModelsCode0
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models0
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks0
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks0
Towards Inference-time Category-wise Safety Steering for Large Language Models0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language ModelsCode0
Overriding Safety protections of Open-source ModelsCode0
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A SurveyCode3
Backtracking Improves Generation Safety0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach0
Mitigating Unsafe Feedback with Learning Constraints0
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation SteeringCode1
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as OptimizerCode0
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation0
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuningCode1
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability DistributionsCode0
Show:102550
← PrevPage 4 of 6Next →

No leaderboard results yet.