| LLMBind: A Unified Modality-Task Integration Framework | Feb 22, 2024 | AI AgentAudio Generation | CodeCode Available | 1 | 5 |
| Do Large Language Model Benchmarks Test Reliability? | Feb 5, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| A Benchmark for Generalizing Across Diverse Team Strategies in Competitive Pokémon | Jun 12, 2025 | Large Language ModelStarcraft | CodeCode Available | 1 | 5 |
| LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations | Jan 23, 2024 | counterfactualFact Checking | CodeCode Available | 1 | 5 |
| Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents | Feb 27, 2025 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling | Mar 2, 2024 | Language ModellingLarge Language Model | CodeCode Available | 1 | 5 |
| DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | May 29, 2024 | DiversityLanguage Modeling | CodeCode Available | 1 | 5 |
| Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators | Mar 25, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |
| LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations | Dec 9, 2024 | Language ModelingLanguage Modelling | CodeCode Available | 1 | 5 |