| Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance | May 22, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| Experimental robustness benchmark of quantum neural network on a superconducting quantum processor | May 22, 2025 | Adversarial AttackAdversarial Robustness | —Unverified | 0 |
| DailyQA: A Benchmark to Evaluate Web Retrieval Augmented LLMs Based on Capturing Real-World Changes | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research | May 22, 2025 | Benchmarking | —Unverified | 0 |
| BAGELS: Benchmarking the Automated Generation and Extraction of Limitations from Scholarly Text | May 22, 2025 | BenchmarkingRAG | —Unverified | 0 |
| MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks | May 22, 2025 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| CUB: Benchmarking Context Utilisation Techniques for Language Models | May 22, 2025 | BenchmarkingFact Checking | —Unverified | 0 |
| Learning collective multi-cellular dynamics from temporal scRNA-seq via a transformer-enhanced Neural SDE | May 22, 2025 | BenchmarkingTime Series | CodeCode Available | 0 |