| Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models | Jan 19, 2024 | Model EditingRed Teaming | CodeCode Available | 0 | 5 |
| Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Nov 15, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming Language Models for Processing Contradictory Dialogues | May 16, 2024 | Red Teamingvalid | CodeCode Available | 0 | 5 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Capability-Based Scaling Laws for LLM Red-Teaming | May 26, 2025 | MMLUPrompt Engineering | CodeCode Available | 0 | 5 |
| Automated Progressive Red Teaming | Jul 4, 2024 | Active LearningRed Teaming | CodeCode Available | 0 | 5 |
| Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM | Dec 10, 2024 | Red Teaming | CodeCode Available | 0 | 5 |
| ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 | 5 |
| No Offense Taken: Eliciting Offensiveness from Language Models | Oct 2, 2023 | DiversityRed Teaming | CodeCode Available | 0 | 5 |
| RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Jul 8, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges | Mar 6, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 | 0 |
| CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring | May 29, 2025 | Red Teaming | —Unverified | 0 | 0 |
| JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Mar 12, 2025 | Red TeamingSafety Alignment | —Unverified | 0 | 0 |
| Conversational Complexity for Assessing Risk in Large Language Models | Sep 2, 2024 | Red Teaming | —Unverified | 0 | 0 |
| A Safe Harbor for AI Evaluation and Red Teaming | Mar 7, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | Jan 9, 2025 | Red Teaming | —Unverified | 0 | 0 |
| Jailbreaking Large Language Models with Symbolic Mathematics | Sep 17, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters | May 30, 2024 | Red Teaming | —Unverified | 0 | 0 |
| Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Nov 15, 2023 | Adversarial AttackRed Teaming | —Unverified | 0 | 0 |
| Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Jan 31, 2025 | Red Teaming | —Unverified | 0 | 0 |
| KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs | Feb 5, 2025 | DiversityPrompt Engineering | —Unverified | 0 | 0 |
| JAB: Joint Adversarial Prompting and Belief Augmentation | Nov 16, 2023 | Red Teaming | —Unverified | 0 | 0 |
| Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | Feb 27, 2018 | Red Teaming | —Unverified | 0 | 0 |
| IterAlign: Iterative Constitutional Alignment of Large Language Models | Mar 27, 2024 | Red Teaming | —Unverified | 0 | 0 |