| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | May 22, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing | May 22, 2025 | Benchmarking | —Unverified | 0 |
| BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| CUB: Benchmarking Context Utilisation Techniques for Language Models | May 22, 2025 | BenchmarkingFact Checking | —Unverified | 0 |
| AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | May 22, 2025 | BenchmarkingFairness | CodeCode Available | 3 |
| Experimental robustness benchmark of quantum neural network on a superconducting quantum processor | May 22, 2025 | Adversarial AttackAdversarial Robustness | —Unverified | 0 |
| Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |