| A2Perf: Real-World Autonomous Agents Benchmark | Mar 4, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Technical report of a DMD-based Characterization Method for Vision Sensors | Mar 4, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages | Mar 3, 2025 | Benchmarking | CodeCode Available | 0 |
| Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics | Mar 3, 2025 | BenchmarkingSpoken Dialogue Systems | —Unverified | 0 |
| Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | Mar 3, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization | Mar 3, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| MAPS: Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Infrastructure | Mar 2, 2025 | Benchmarking | —Unverified | 0 |
| Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks | Mar 2, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information | Mar 1, 2025 | Benchmarking | —Unverified | 0 |