| Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities | May 20, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Nov 15, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| Distract Large Language Models for Automatic Jailbreak Attack | Mar 13, 2024 | Red Teaming | CodeCode Available | 0 | 5 |
| Capability-Based Scaling Laws for LLM Red-Teaming | May 26, 2025 | MMLUPrompt Engineering | CodeCode Available | 0 | 5 |
| An Auditing Test To Detect Behavioral Shift in Language Models | Oct 25, 2024 | BenchmarkingChange Detection | CodeCode Available | 0 | 5 |
| RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts | Jan 29, 2025 | ChatbotRed Teaming | CodeCode Available | 0 | 5 |
| Red-Teaming Segment Anything Model | Apr 2, 2024 | Image Segmentationmodel | CodeCode Available | 0 | 5 |
| SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis | Oct 21, 2024 | LLM JailbreakRed Teaming | CodeCode Available | 0 | 5 |
| SAGE: A Generic Framework for LLM Safety Evaluation | Apr 28, 2025 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? | Apr 4, 2024 | Red Teaming | CodeCode Available | 0 | 5 |
| Bias patterns in the application of LLMs for clinical decision support: A comprehensive study | Apr 23, 2024 | Decision MakingQuestion Answering | CodeCode Available | 0 | 5 |
| Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks | Dec 30, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | May 11, 2025 | Outlier DetectionRed Teaming | CodeCode Available | 0 | 5 |
| RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| RedDebate: Safer Responses through Multi-Agent Red Teaming Debates | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming Language Models for Processing Contradictory Dialogues | May 16, 2024 | Red Teamingvalid | CodeCode Available | 0 | 5 |
| BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Jun 3, 2025 | Prompt EngineeringRed Teaming | CodeCode Available | 0 | 5 |
| Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Aug 27, 2024 | Red TeamingTransfer Learning | CodeCode Available | 0 | 5 |
| Overriding Safety protections of Open-source Models | Sep 28, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| Aligners: Decoupling LLMs and Alignment | Mar 7, 2024 | Instruction FollowingRed Teaming | CodeCode Available | 0 | 5 |
| BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Oct 17, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |
| RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Jul 8, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM | Dec 10, 2024 | Red Teaming | CodeCode Available | 0 | 5 |
| Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Oct 31, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 | 5 |