SOTAVerified

Red Teaming

Papers

Showing 201225 of 251 papers

TitleStatusHype
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
Safety Alignment for Vision Language Models0
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents0
A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI0
Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses0
Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model0
AI red-teaming is a sociotechnical challenge: on values, labor, and harms0
A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation0
AdvAgent: Controllable Blackbox Red-teaming on Web Agents0
Understanding and Mitigating Risks of Generative AI in Financial Services0
Adversaries Can Misuse Combinations of Safe Models0
STACK: Adversarial Attacks on LLM Safeguard Pipelines0
STAR: SocioTechnical Approach to Red Teaming Language Models0
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications0
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning0
AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents0
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming0
A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management0
GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
Show:102550
← PrevPage 9 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified