| ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing | Jul 10, 2024 | FairnessRed Teaming | —Unverified | 0 |
| Automated Progressive Red Teaming | Jul 4, 2024 | Active LearningRed Teaming | CodeCode Available | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Purple-teaming LLMs with Adversarial Defender Training | Jul 1, 2024 | Generative Adversarial NetworkRed Teaming | —Unverified | 0 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations | Jun 25, 2024 | Red TeamingReinforcement Learning (RL) | —Unverified | 0 |
| Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Jun 21, 2024 | Red TeamingTruthfulQA | CodeCode Available | 0 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Adversaries Can Misuse Combinations of Safe Models | Jun 20, 2024 | Red Teaming | —Unverified | 0 |