SOTAVerified

Red Teaming

Papers

Showing 6170 of 251 papers

TitleStatusHype
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMsCode1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-TrainingCode1
Explore, Establish, Exploit: Red Teaming Language Models from ScratchCode1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring TechniqueCode1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn PlannerCode1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing ConstraintsCode1
Jailbreaking as a Reward Misspecification ProblemCode1
Gandalf the Red: Adaptive Security for LLMsCode1
Defending Against Unforeseen Failure Modes with Latent Adversarial TrainingCode1
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic PromptsCode1
Show:102550
← PrevPage 7 of 26Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified