| QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Feb 13, 2025 | Safety Alignment | CodeCode Available | 1 |
| Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions | Feb 6, 2025 | Safety Alignment | CodeCode Available | 1 |
| Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation | Jan 30, 2025 | Safety Alignment | CodeCode Available | 1 |
| xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking | Jan 28, 2025 | Reinforcement Learning (RL)Safety Alignment | CodeCode Available | 1 |
| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Dec 7, 2024 | Red TeamingSafety Alignment | CodeCode Available | 1 |
| Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Nov 27, 2024 | Safety AlignmentVisual Reasoning | CodeCode Available | 1 |
| SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Oct 29, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 |
| Bayesian scaling laws for in-context learning | Oct 21, 2024 | In-Context LearningSafety Alignment | CodeCode Available | 1 |
| Locking Down the Finetuned LLMs Safety | Oct 14, 2024 | Safety Alignment | CodeCode Available | 1 |