SOTAVerified

Red Teaming

Papers

Showing 151200 of 251 papers

TitleStatusHype
ELAB: Extensive LLM Alignment Benchmark in Persian Language0
Embodied Red Teaming for Auditing Robotic Foundation Models0
EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection0
Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity0
Exploring Straightforward Conversational Red-Teaming0
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation0
Fast Proxies for LLM Robustness Evaluation0
Finding Safety Neurons in Large Language Models0
FLIRT: Feedback Loop In-context Red Teaming0
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols0
GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization0
h4rm3l: A language for Composable Jailbreak Attack Synthesis0
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs0
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents0
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback0
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models0
Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis0
Investigating Bias Representations in Llama 2 Chat via Activation Steering0
IterAlign: Iterative Constitutional Alignment of Large Language Models0
JAB: Joint Adversarial Prompting and Belief Augmentation0
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts0
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters0
Jailbreaking Large Language Models with Symbolic Mathematics0
Red Teaming AI Policy: A Taxonomy of Avoision and the EU AI Act0
Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives0
Red-Teaming for Generative AI: Silver Bullet or Security Theater?0
Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms0
Red Teaming Large Language Models for Healthcare0
Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI0
Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling0
Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs0
Red-Teaming the Stable Diffusion Safety Filter0
Red Teaming Visual Language Models0
Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review0
Reinforced Diffuser for Red Teaming Large Vision-Language Models0
RRTL: Red Teaming Reasoning Large Language Models in Tool Learning0
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming0
SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models?0
Safety Alignment for Vision Language Models0
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods0
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming0
Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses0
Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model0
STACK: Adversarial Attacks on LLM Safeguard Pipelines0
STAR: SocioTechnical Approach to Red Teaming Language Models0
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models0
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning0
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming0
Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness0
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints0
Show:102550
← PrevPage 4 of 6Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified