| Can Large Language Models Automatically Jailbreak GPT-4V? | Jul 23, 2024 | Face RecognitionIn-Context Learning | —Unverified | 0 |
| RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Jul 23, 2024 | Red Teaming | —Unverified | 0 |
| Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Jul 22, 2024 | Model EditingRed Teaming | CodeCode Available | 1 |
| Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Jul 22, 2024 | Contrastive LearningGender Prediction | —Unverified | 0 |
| Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Jul 21, 2024 | EthicsRed Teaming | —Unverified | 0 |
| Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Jul 20, 2024 | Red Teaming | CodeCode Available | 1 |
| Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Jul 18, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Jul 17, 2024 | Red Teaming | —Unverified | 0 |
| AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | Jul 17, 2024 | Autonomous DrivingBackdoor Attack | CodeCode Available | 3 |
| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 |
| ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts | Jul 12, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 0 |
| The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing | Jul 10, 2024 | FairnessRed Teaming | —Unverified | 0 |
| Automated Progressive Red Teaming | Jul 4, 2024 | Active LearningRed Teaming | CodeCode Available | 0 |
| SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Jul 2, 2024 | Red TeamingSafety Alignment | CodeCode Available | 0 |
| Purple-teaming LLMs with Adversarial Defender Training | Jul 1, 2024 | Generative Adversarial NetworkRed Teaming | —Unverified | 0 |
| WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Jun 26, 2024 | ChatbotRed Teaming | CodeCode Available | 2 |
| The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Jun 26, 2024 | Cross-Lingual TransferRed Teaming | —Unverified | 0 |
| CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Jun 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations | Jun 25, 2024 | Red TeamingReinforcement Learning (RL) | —Unverified | 0 |
| Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Jun 21, 2024 | Red TeamingTruthfulQA | CodeCode Available | 0 |
| Adversaries Can Misuse Combinations of Safe Models | Jun 20, 2024 | Red Teaming | —Unverified | 0 |
| Jailbreaking as a Reward Misspecification Problem | Jun 20, 2024 | Red Teaming | CodeCode Available | 1 |
| Finding Safety Neurons in Large Language Models | Jun 20, 2024 | MisinformationRed Teaming | —Unverified | 0 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 |
| Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Jun 17, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |