| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 | 5 |
| RedDebate: Safer Responses through Multi-Agent Red Teaming Debates | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks | Dec 30, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Jul 8, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Aug 27, 2024 | Red TeamingTransfer Learning | CodeCode Available | 0 | 5 |
| Aligners: Decoupling LLMs and Alignment | Mar 7, 2024 | Instruction FollowingRed Teaming | CodeCode Available | 0 | 5 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM | Dec 10, 2024 | Red Teaming | CodeCode Available | 0 | 5 |