SOTAVerified|Agents Browse Leaderboard About

Red Teaming

Papers

Recently Added Most Hyped Most Active Needs Verification Most Verified

Showing 151–160 of 251 papers

Title	Date	Tasks	Status	Hype	Score
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models	Nov 16, 2023	Backdoor AttackData Poisoning	—Unverified	0	0
OpenAI o1 System Card	Dec 21, 2024	ManagementRed Teaming	—Unverified	0	0
Can Language Models be Instructed to Protect Personal Information?	Oct 3, 2023	Adversarial RobustnessRed Teaming	—Unverified	0	0
The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward	Aug 28, 2023	EthicsPhilosophy	—Unverified	0	0
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback	Mar 9, 2023	Red Teaming	—Unverified	0	0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle	Jul 18, 2024	BenchmarkingLanguage Modeling	—Unverified	0	0
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models	Jan 14, 2025	Red Teaming	—Unverified	0	0
POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI	Dec 21, 2024	LLM JailbreakRed Teaming	—Unverified	0	0
Predictive Red Teaming: Breaking Policies Without Breaking Robots	Feb 10, 2025	Imitation LearningRed Teaming	—Unverified	0	0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models	Mar 3, 2025	Red TeamingSurvey	—Unverified	0	0

Show:10 25 50

← PrevPage 16 of 26Next →

Benchmark Results

#	Model	Metric	Claimed	Verified	Status
1	SUDO	Attack Success Rate	41	—	Unverified