| One ruler to measure them all: Benchmarking multilingual long-context language models | Mar 3, 2025 | 8kAll | CodeCode Available | 1 |
| MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages | Mar 3, 2025 | Benchmarking | CodeCode Available | 0 |
| AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses | Mar 3, 2025 | Benchmarking | CodeCode Available | 1 |
| Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | Mar 3, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image Segmentation | Mar 3, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics | Mar 3, 2025 | BenchmarkingSpoken Dialogue Systems | —Unverified | 0 |
| Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization | Mar 3, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Delving into Out-of-Distribution Detection with Medical Vision-Language Models | Mar 2, 2025 | Benchmarkingimage-classification | CodeCode Available | 1 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| MAPS: Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Infrastructure | Mar 2, 2025 | Benchmarking | —Unverified | 0 |
| Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks | Mar 2, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information | Mar 1, 2025 | Benchmarking | —Unverified | 0 |
| LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation | Feb 28, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| NeuroMorse: A Temporally Structured Dataset For Neuromorphic Computing | Feb 28, 2025 | Benchmarking | CodeCode Available | 0 |
| ProBench: Benchmarking Large Language Models in Competitive Programming | Feb 28, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization | Feb 28, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice | Feb 28, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series | Feb 28, 2025 | BenchmarkingSolar Irradiance Forecasting | —Unverified | 0 |
| Protein Structure Tokenization: Benchmarking and New Recipe | Feb 28, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| EgoNormia: Benchmarking Physical Social Norm Understanding | Feb 27, 2025 | Answer GenerationBenchmarking | CodeCode Available | 1 |
| OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection | Feb 27, 2025 | Action DetectionBenchmarking | CodeCode Available | 3 |
| ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments | Feb 27, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches | Feb 27, 2025 | BenchmarkingPhotoplethysmography (PPG) | CodeCode Available | 0 |
| LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping | Feb 27, 2025 | Benchmarking | CodeCode Available | 0 |