| garak: A Framework for Security Probing Large Language Models | Jun 16, 2024 | Red Teaming | CodeCode Available | 9 |
| PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System | Oct 1, 2024 | Red Teaming | CodeCode Available | 7 |
| Seamless: Multilingual Expressive and Streaming Speech Translation | Dec 8, 2023 | automatic-speech-translationMachine Translation | CodeCode Available | 6 |
| HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal | Feb 6, 2024 | Red Teaming | CodeCode Available | 4 |
| AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs | Oct 3, 2024 | Red Teaming | CodeCode Available | 3 |
| AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | Jul 17, 2024 | Autonomous DrivingBackdoor Attack | CodeCode Available | 3 |
| Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | Aug 23, 2022 | Language ModellingRed Teaming | CodeCode Available | 3 |
| Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Jan 29, 2025 | Red TeamingSafety Alignment | CodeCode Available | 2 |
| LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Aug 27, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 2 |
| Tamper-Resistant Safeguards for Open-Weight LLMs | Aug 1, 2024 | Red TeamingTAR | CodeCode Available | 2 |