| Aloe: A Family of Fine-tuned Open Healthcare LLMs | May 3, 2024 | Prompt EngineeringRed Teaming | CodeCode Available | 1 |
| Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo | Apr 26, 2024 | Language ModellingPrompt Engineering | CodeCode Available | 1 |
| Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Mar 8, 2024 | image-classificationImage Classification | CodeCode Available | 1 |
| Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation | Feb 14, 2024 | Image GenerationRed Teaming | CodeCode Available | 1 |
| Causality Analysis for Evaluating the Security of Large Language Models | Dec 13, 2023 | Red Teaming | CodeCode Available | 1 |
| AI Control: Improving Safety Despite Intentional Subversion | Dec 12, 2023 | Red Teaming | CodeCode Available | 1 |
| Control Risk for Potential Misuse of Artificial Intelligence in Science | Dec 11, 2023 | Red Teaming | CodeCode Available | 1 |
| Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Nov 15, 2023 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | Oct 22, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Attack Prompt Generation for Red Teaming and Defending Large Language Models | Oct 19, 2023 | In-Context LearningRed Teaming | CodeCode Available | 1 |