SOTAVerified|Agents Browse Leaderboard About Blog

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 211–220 of 251 papers

Title	Date	Tasks	Status	Hype
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment	Oct 12, 2024	DiversityHallucination	—Unverified	0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines	Apr 29, 2025	Red Teaming	—Unverified	0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters	Jul 2, 2024	Red TeamingSafety Alignment	CodeCode Available	0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming	Jun 4, 2025	Red Teaming	CodeCode Available	0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models	Jan 19, 2024	Model EditingRed Teaming	CodeCode Available	0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing	Sep 14, 2024	Red Teaming	CodeCode Available	0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models	Aug 27, 2024	Red TeamingTransfer Learning	CodeCode Available	0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks	Dec 30, 2023	Red Teaming	CodeCode Available	0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections	Nov 15, 2023	Red Teaming	CodeCode Available	0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?	Apr 4, 2024	Red Teaming	CodeCode Available	0

Show:10 25 50

← PrevPage 22 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified