SOTAVerified|Agents Browse Leaderboard About Blog

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 221–230 of 251 papers

Title	Date	Tasks	Status	Hype
RedDebate: Safer Responses through Multi-Agent Red Teaming Debates	Jun 4, 2025	Red Teaming	CodeCode Available	0
Red Teaming Language Models for Processing Contradictory Dialogues	May 16, 2024	Red Teamingvalid	CodeCode Available	0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages	Jul 8, 2025	Red Teaming	CodeCode Available	0
Overriding Safety protections of Open-source Models	Sep 28, 2024	Red TeamingSafety Alignment	CodeCode Available	0
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents	Sep 5, 2022	Red Teamingreinforcement-learning	CodeCode Available	0
No Offense Taken: Eliciting Offensiveness from Language Models	Oct 2, 2023	DiversityRed Teaming	CodeCode Available	0
Steering Without Side Effects: Improving Post-Deployment Control of Language Models	Jun 21, 2024	Red TeamingTruthfulQA	CodeCode Available	0
Red-Teaming Segment Anything Model	Apr 2, 2024	Image Segmentationmodel	CodeCode Available	0
Bias patterns in the application of LLMs for clinical decision support: A comprehensive study	Apr 23, 2024	Decision MakingQuestion Answering	CodeCode Available	0
Capability-Based Scaling Laws for LLM Red-Teaming	May 26, 2025	MMLUPrompt Engineering	CodeCode Available	0

Show:10 25 50

← PrevPage 23 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified