SOTAVerified|Agents Browse Leaderboard About Blog

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 31–40 of 251 papers

Title	Date	Tasks	Status	Hype
Gandalf the Red: Adaptive Security for LLMs	Jan 14, 2025	BlockingLanguage Modeling	CodeCode Available	1
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs	Nov 21, 2024	Bayesian OptimizationRed Teaming	CodeCode Available	1
Aloe: A Family of Fine-tuned Open Healthcare LLMs	May 3, 2024	Prompt EngineeringRed Teaming	CodeCode Available	1
Explore, Establish, Exploit: Red Teaming Language Models from Scratch	Jun 15, 2023	Red Teaming	CodeCode Available	1
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique	Aug 20, 2024	AI and SafetyDiversity	CodeCode Available	1
Defending Against Unforeseen Failure Modes with Latent Adversarial Training	Mar 8, 2024	image-classificationImage Classification	CodeCode Available	1
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference	Jun 25, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner	Jun 17, 2024	Language ModelingLanguage Modelling	CodeCode Available	1
Attack Prompt Generation for Red Teaming and Defending Large Language Models	Oct 19, 2023	In-Context LearningRed Teaming	CodeCode Available	1
AI Control: Improving Safety Despite Intentional Subversion	Dec 12, 2023	Red Teaming	CodeCode Available	1

Show:10 25 50

← PrevPage 4 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified