| A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement | Oct 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| SPIN: Self-Supervised Prompt INjection | Oct 17, 2024 | Safety Alignment | —Unverified | 0 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Can a large language model be a gaslighter? | Oct 11, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements | Oct 11, 2024 | Safety Alignment | —Unverified | 0 |
| Superficial Safety Alignment Hypothesis | Oct 7, 2024 | AttributeBinary Classification | —Unverified | 0 |
| Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models | Oct 7, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| Toxic Subword Pruning for Dialogue Response Generation on Large Language Models | Oct 5, 2024 | Language ModellingMachine Translation | —Unverified | 0 |
| LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Oct 3, 2024 | Adversarial RobustnessSafety Alignment | —Unverified | 0 |