SOTAVerified

Red Teaming

Papers

Showing 181190 of 251 papers

TitleStatusHype
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic PromptsCode0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
Automated Progressive Red TeamingCode0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
Purple-teaming LLMs with Adversarial Defender Training0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations0
Steering Without Side Effects: Improving Post-Deployment Control of Language ModelsCode0
Finding Safety Neurons in Large Language Models0
Adversaries Can Misuse Combinations of Safe Models0
Show:102550
← PrevPage 19 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified