SOTAVerified

Red Teaming

Papers

Showing 151175 of 251 papers

TitleStatusHype
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
OpenAI o1 System Card0
Can Language Models be Instructed to Protect Personal Information?0
The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward0
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models0
POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI0
Predictive Red Teaming: Breaking Policies Without Breaking Robots0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Purple-teaming LLMs with Adversarial Defender Training0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models0
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models0
Automating Privilege Escalation with Deep Reinforcement Learning0
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations0
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester0
Towards medical AI misalignment: a preliminary study0
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code0
Show:102550
← PrevPage 7 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified