| Large Language Model Unlearning | Oct 14, 2023 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Jun 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| Aloe: A Family of Fine-tuned Open Healthcare LLMs | May 3, 2024 | Prompt EngineeringRed Teaming | CodeCode Available | 1 | 5 |
| Control Risk for Potential Misuse of Artificial Intelligence in Science | Dec 11, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Jailbreaking as a Reward Misspecification Problem | Jun 20, 2024 | Red Teaming | CodeCode Available | 1 | 5 |
| Jailbroken: How Does LLM Safety Training Fail? | Jul 5, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Oct 10, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Attack Prompt Generation for Red Teaming and Defending Large Language Models | Oct 19, 2023 | In-Context LearningRed Teaming | CodeCode Available | 1 | 5 |
| AI Control: Improving Safety Despite Intentional Subversion | Dec 12, 2023 | Red Teaming | CodeCode Available | 1 | 5 |
| Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Mar 8, 2024 | image-classificationImage Classification | CodeCode Available | 1 | 5 |