| Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits | Jun 3, 2024 | Red Teaming | CodeCode Available | 1 | 5 |
| Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Jun 17, 2024 | 16kLanguage Modelling | CodeCode Available | 0 | 5 |
| ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models | Oct 14, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? | Apr 4, 2024 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming Language Models for Processing Contradictory Dialogues | May 16, 2024 | Red Teamingvalid | CodeCode Available | 0 | 5 |
| RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming | Jun 4, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks | Dec 30, 2023 | Red Teaming | CodeCode Available | 0 | 5 |
| RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Jul 8, 2025 | Red Teaming | CodeCode Available | 0 | 5 |
| Capability-Based Scaling Laws for LLM Red-Teaming | May 26, 2025 | MMLUPrompt Engineering | CodeCode Available | 0 | 5 |
| An Auditing Test To Detect Behavioral Shift in Language Models | Oct 25, 2024 | BenchmarkingChange Detection | CodeCode Available | 0 | 5 |