SOTAVerified
|
Agents
Browse
Leaderboard
About
Tasks
›
Safety Alignment
Safety Alignment
Papers
Recently Added
Most Hyped
Most Active
Needs Verification
Most Verified
Showing 281–288 of 288 papers
Title
Date
Tasks
Status
Hype
Score
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models
Feb 3, 2025
Safety Alignment
—
Unverified
0
0
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Feb 24, 2025
Safety Alignment
—
Unverified
0
0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Jun 26, 2024
Cross-Lingual Transfer
Red Teaming
—
Unverified
0
0
Thought Manipulation: External Thought Can Be Efficient for Large Reasoning Models
Apr 18, 2025
Safety Alignment
—
Unverified
0
0
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching
May 22, 2024
Safety Alignment
—
Unverified
0
0
Towards Inference-time Category-wise Safety Steering for Large Language Models
Oct 2, 2024
Safety Alignment
—
Unverified
0
0
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization
Apr 19, 2025
Contrastive Learning
Image Generation
—
Unverified
0
0
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
Jan 27, 2025
Language Modeling
Language Modelling
—
Unverified
0
0
Show:
10
25
50
← Prev
Page 29 of 29
Next →
No leaderboard results yet.