SOTAVerified

Red Teaming

Papers

Showing 151200 of 251 papers

TitleStatusHype
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models0
OpenAI o1 System Card0
Can Language Models be Instructed to Protect Personal Information?0
The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward0
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback0
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle0
Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models0
POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI0
Predictive Red Teaming: Breaking Policies Without Breaking Robots0
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models0
Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems0
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming0
Purple-teaming LLMs with Adversarial Defender Training0
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models0
Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models0
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models0
Automating Privilege Escalation with Deep Reinforcement Learning0
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations0
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester0
Towards medical AI misalignment: a preliminary study0
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code0
Red Teaming AI Policy: A Taxonomy of Avoision and the EU AI Act0
Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives0
Red-Teaming for Generative AI: Silver Bullet or Security Theater?0
Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework0
Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms0
Towards Red Teaming in Multimodal and Multilingual Translation0
AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning0
Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges0
Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI0
A Safe Harbor for AI Evaluation and Red Teaming0
Red Teaming Large Language Models for Healthcare0
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts0
Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI0
A Framework for Evaluating Emerging Cyberattack Capabilities of AI0
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling0
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs0
Red-Teaming the Stable Diffusion Safety Filter0
Red Teaming Visual Language Models0
Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review0
A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming0
Reinforced Diffuser for Red Teaming Large Vision-Language Models0
A Red Teaming Roadmap Towards System-Level Safety0
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents0
A Red Teaming Framework for Securing AI in Maritime Autonomous Systems0
RRTL: Red Teaming Reasoning Large Language Models in Tool Learning0
Show:102550
← PrevPage 4 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified