SOTAVerified|Agents Browse Leaderboard About Blog

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 61–70 of 251 papers

Title	Date	Tasks	Status	Hype
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs	Jul 22, 2024	Model EditingRed Teaming	CodeCode Available	1
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training	Mar 24, 2025	DiversityLarge Language Model	CodeCode Available	1
Explore, Establish, Exploit: Red Teaming Language Models from Scratch	Jun 15, 2023	Red Teaming	CodeCode Available	1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner	Jun 17, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints	May 29, 2024	DiversityLanguage Modeling	CodeCode Available	1
Jailbreaking as a Reward Misspecification Problem	Jun 20, 2024	Red Teaming	CodeCode Available	1
Gandalf the Red: Adaptive Security for LLMs	Jan 14, 2025	BlockingLanguage Modeling	CodeCode Available	1
Defending Against Unforeseen Failure Modes with Latent Adversarial Training	Mar 8, 2024	image-classificationImage Classification	CodeCode Available	1
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts	Sep 12, 2023	Red TeamingText-to-Image Generation	CodeCode Available	1

Show:10 25 50

← PrevPage 7 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified