SOTAVerified

Red Teaming

Papers

Showing 201250 of 251 papers

TitleStatusHype
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
Safety Alignment for Vision Language Models0
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses0
Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model0
AI red-teaming is a sociotechnical challenge: on values, labor, and harms0
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation0
AdvAgent: Controllable Blackbox Red-teaming on Web Agents0
Understanding and Mitigating Risks of Generative AI in Financial Services0
Adversaries Can Misuse Combinations of Safe Models0
STACK: Adversarial Attacks on LLM Safeguard Pipelines0
STAR: SocioTechnical Approach to Red Teaming Language Models0
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications0
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning0
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents0
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models0
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection0
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity0
Exploring Straightforward Conversational Red-Teaming0
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation0
Fast Proxies for LLM Robustness Evaluation0
Embodied Red Teaming for Auditing Robotic Foundation Models0
Finding Safety Neurons in Large Language Models0
ELAB: Extensive LLM Alignment Benchmark in Persian Language0
FLIRT: Feedback Loop In-context Red Teaming0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
Effective Red-Teaming of Policy-Adherent Agents0
DMRL: Data- and Model-aware Reward Learning for Data Extraction0
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization0
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models0
Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization0
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents0
Atoxia: Red-teaming Large Language Models with Target Toxic Answers0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models0
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints0
Show:102550
← PrevPage 5 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified