| DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions | Dec 7, 2023 | Code GenerationRed Teaming | —Unverified | 0 |
| InfoPattern: Unveiling Information Propagation Patterns in Social Media | Nov 27, 2023 | Red TeamingStance Detection | CodeCode Available | 0 |
| JAB: Joint Adversarial Prompting and Belief Augmentation | Nov 16, 2023 | Red Teaming | —Unverified | 0 |
| RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Nov 16, 2023 | Backdoor AttackData Poisoning | —Unverified | 0 |
| Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework | Nov 15, 2023 | Red Teaming | —Unverified | 0 |
| Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Nov 15, 2023 | Adversarial AttackRed Teaming | —Unverified | 0 |
| Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Nov 15, 2023 | Red Teaming | CodeCode Available | 0 |
| AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Nov 14, 2023 | DiversityRed Teaming | —Unverified | 0 |
| MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Nov 13, 2023 | Instruction FollowingRed Teaming | —Unverified | 0 |
| Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming | Nov 10, 2023 | Red Teaming | —Unverified | 0 |