| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control | May 22, 2025 | BenchmarkingPansharpening | CodeCode Available | 0 |
| When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | May 22, 2025 | Benchmarking | —Unverified | 0 |
| Experimental robustness benchmark of quantum neural network on a superconducting quantum processor | May 22, 2025 | Adversarial AttackAdversarial Robustness | —Unverified | 0 |
| Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | May 22, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models | May 22, 2025 | BenchmarkingFairness | CodeCode Available | 3 |
| SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation | May 21, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction | May 21, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets | May 21, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation | May 21, 2025 | BenchmarkingRAG | —Unverified | 0 |
| VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models | May 21, 2025 | Benchmarking | CodeCode Available | 0 |
| Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals | May 21, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking | May 21, 2025 | BenchmarkingClaim Verification | CodeCode Available | 0 |
| UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | May 21, 2025 | BenchmarkingImitation Learning | —Unverified | 0 |
| Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems | May 21, 2025 | BenchmarkingMath | —Unverified | 0 |
| Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | May 21, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| Benchmarking Energy and Latency in TinyML: A Novel Method for Resource-Constrained AI | May 21, 2025 | Benchmarking | —Unverified | 0 |
| Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models | May 21, 2025 | BenchmarkingPrompt Engineering | —Unverified | 0 |
| VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models | May 21, 2025 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory | May 21, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Oral Imaging for Malocclusion Issues Assessments: OMNI Dataset, Deep Learning Baselines and Benchmarking | May 21, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| A Risk Taxonomy for Evaluating AI-Powered Psychotherapy Agents | May 21, 2025 | BenchmarkingDecompensation | —Unverified | 0 |
| Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks | May 21, 2025 | BenchmarkingGPU | —Unverified | 0 |
| DECASTE: Unveiling Caste Stereotypes in Large Language Models through Multi-Dimensional Bias Analysis | May 20, 2025 | BenchmarkingFairness | —Unverified | 0 |