SOTAVerified|Agents Browse Leaderboard About

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 241–250 of 251 papers

Title	Date	Tasks	Status	Hype
An Auditing Test To Detect Behavioral Shift in Language Models	Oct 25, 2024	BenchmarkingChange Detection	CodeCode Available	0
ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Low-Perplexity Toxic Prompts	Jul 12, 2024	Language ModelingLanguage Modelling	CodeCode Available	0
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models	Oct 14, 2023	Red Teaming	CodeCode Available	0
The Structural Safety Generalization Problem	Apr 13, 2025	Red Teaming	CodeCode Available	0
BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models	Oct 17, 2024	Red TeamingSafety Alignment	CodeCode Available	0
Automated Progressive Red Teaming	Jul 4, 2024	Active LearningRed Teaming	CodeCode Available	0
Aligners: Decoupling LLMs and Alignment	Mar 7, 2024	Instruction FollowingRed Teaming	CodeCode Available	0
We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems	Jun 16, 2025	PositionRed Teaming	CodeCode Available	0
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding	Jun 17, 2024	16kLanguage Modelling	CodeCode Available	0
SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis	Oct 21, 2024	LLM JailbreakRed Teaming	CodeCode Available	0

Show:10 25 50

← PrevPage 25 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified