| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation | May 21, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction | May 21, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | May 21, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets | May 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 |
| VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models | May 21, 2025 | Benchmarking | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals | May 21, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |