SOTAVerified

Red Teaming

Papers

Showing 201225 of 251 papers

TitleStatusHype
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints0
The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing0
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm0
The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward0
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming0
Towards medical AI misalignment: a preliminary study0
Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework0
Towards Red Teaming in Multimodal and Multilingual Translation0
Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges0
Understanding and Mitigating Risks of Generative AI in Financial Services0
VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment0
When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines0
SeqAR: Jailbreak LLMs with Sequential Auto-Generated CharactersCode0
RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red TeamingCode0
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language ModelsCode0
What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data SlicingCode0
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language ModelsCode0
Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics TasksCode0
Stealthy and Persistent Unalignment on Large Language Models via Backdoor InjectionsCode0
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?Code0
RedDebate: Safer Responses through Multi-Agent Red Teaming DebatesCode0
Red Teaming Language Models for Processing Contradictory DialoguesCode0
RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource LanguagesCode0
Overriding Safety protections of Open-source ModelsCode0
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL AgentsCode0
Show:102550
← PrevPage 9 of 11Next →

Benchmark Results

#ModelMetricClaimedVerifiedStatus
1SUDOAttack Success Rate41Unverified