SOTAVerified|Agents Browse Leaderboard About

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 181–190 of 251 papers

Title	Date	Tasks	Status	Hype
Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI	Mar 12, 2024	Hyperspectral image analysisHYPERVIEW Challenge	—Unverified	0
Defending Against Unforeseen Failure Modes with Latent Adversarial Training	Mar 8, 2024	image-classificationImage Classification	CodeCode Available	1
Aligners: Decoupling LLMs and Alignment	Mar 7, 2024	Instruction FollowingRed Teaming	CodeCode Available	0
A Safe Harbor for AI Evaluation and Red Teaming	Mar 7, 2024	Red Teaming	—Unverified	0
Curiosity-driven Red-teaming for Large Language Models	Feb 29, 2024	Red TeamingReinforcement Learning (RL)	CodeCode Available	2
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning	Feb 21, 2024	Graph Neural NetworkRed Teaming	—Unverified	0
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation	Feb 14, 2024	Image GenerationRed Teaming	CodeCode Available	1
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast	Feb 13, 2024	Language ModellingLarge Language Model	CodeCode Available	2
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal	Feb 6, 2024	Red Teaming	CodeCode Available	4
Investigating Bias Representations in Llama 2 Chat via Activation Steering	Feb 1, 2024	Decision MakingRed Teaming	—Unverified	0

Show:10 25 50

← PrevPage 19 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified