Safety Alignment

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–200 of 288 papers

Title	Date	Tasks	Status	Hype
Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars	Dec 10, 2024	Safety Alignment	—Unverified	0
SafeWorld: Geo-Diverse Safety Alignment	Dec 9, 2024	Safety Alignment	CodeCode Available	0
PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage	Dec 7, 2024	Red TeamingSafety Alignment	CodeCode Available	1
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models	Nov 30, 2024	Safety Alignment	—Unverified	0
PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning	Nov 28, 2024	Federated Learningparameter-efficient fine-tuning	—Unverified	0
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment	Nov 27, 2024	Safety AlignmentVisual Reasoning	CodeCode Available	1
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models	Nov 27, 2024	Image GenerationSafety Alignment	—Unverified	0
Don't Command, Cultivate: An Exploratory Study of System-2 Alignment	Nov 26, 2024	Prompt EngineeringSafety Alignment	CodeCode Available	0
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine	Nov 20, 2024	FairnessSafety Alignment	—Unverified	0
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment	Nov 18, 2024	Language ModelingLanguage Modelling	—Unverified	0
Playing Language Game with LLMs Leads to Jailbreaking	Nov 16, 2024	Safety Alignment	—Unverified	0
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models	Nov 6, 2024	Safety Alignment	—Unverified	0
Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment	Nov 5, 2024	QuantizationSafety Alignment	CodeCode Available	0
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs	Nov 4, 2024	Cross-Lingual TransferLanguage Acquisition	—Unverified	0
Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models	Oct 31, 2024	Red TeamingSafety Alignment	CodeCode Available	0
Smaller Large Language Models Can Do Moral Self-Correction	Oct 30, 2024	Language ModelingLanguage Modelling	—Unverified	0
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types	Oct 29, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization	Oct 25, 2024	Safety Alignment	CodeCode Available	0
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities	Oct 24, 2024	Safety Alignment	CodeCode Available	0
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks	Oct 23, 2024	Instruction FollowingSafety Alignment	—Unverified	0
Bayesian scaling laws for in-context learning	Oct 21, 2024	In-Context LearningSafety Alignment	CodeCode Available	1
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models	Oct 17, 2024	Red TeamingSafety Alignment	CodeCode Available	0
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement	Oct 17, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
SPIN: Self-Supervised Prompt INjection	Oct 17, 2024	Safety Alignment	—Unverified	0
Locking Down the Finetuned LLMs Safety	Oct 14, 2024	Safety Alignment	CodeCode Available	1
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues	Oct 14, 2024	LLM JailbreakSafety Alignment	CodeCode Available	2
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation	Oct 13, 2024	Safety AlignmentTAR	CodeCode Available	1
Can a large language model be a gaslighter?	Oct 11, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation	Oct 11, 2024	Safety Alignment	CodeCode Available	1
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models	Oct 11, 2024	Safety Alignment	—Unverified	0
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements	Oct 11, 2024	Safety Alignment	—Unverified	0
Superficial Safety Alignment Hypothesis	Oct 7, 2024	AttributeBinary Classification	—Unverified	0
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models	Oct 7, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models	Oct 5, 2024	Language ModellingMachine Translation	—Unverified	0
LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks	Oct 3, 2024	Adversarial RobustnessSafety Alignment	—Unverified	0
SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks	Oct 2, 2024	Safety Alignment	—Unverified	0
Towards Inference-time Category-wise Safety Steering for Large Language Models	Oct 2, 2024	Safety Alignment	—Unverified	0
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models	Oct 1, 2024	Safety Alignment	CodeCode Available	0
Overriding Safety protections of Open-source Models	Sep 28, 2024	Red TeamingSafety Alignment	CodeCode Available	0
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey	Sep 26, 2024	Safety Alignment	CodeCode Available	3
Backtracking Improves Generation Safety	Sep 22, 2024	Language ModelingLanguage Modelling	—Unverified	0
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach	Sep 21, 2024	Multi-agent Reinforcement LearningSafety Alignment	—Unverified	0
Mitigating Unsafe Feedback with Learning Constraints	Sep 19, 2024	Safety AlignmentText Generation	—Unverified	0
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering	Aug 21, 2024	Safety Alignment	CodeCode Available	1
Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer	Aug 21, 2024	Safety Alignment	CodeCode Available	0
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation	Aug 20, 2024	Safety Alignment	—Unverified	0
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning	Aug 18, 2024	PhilosophySafety Alignment	CodeCode Available	1
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming	Aug 14, 2024	Red TeamingSafety Alignment	—Unverified	0
Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions	Aug 14, 2024	Safety Alignment	CodeCode Available	0

Show:10 25 50

← PrevPage 4 of 6Next →

No leaderboard results yet.