| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |
| Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases | Mar 6, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms | Mar 6, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Eventprop training for efficient neuromorphic applications | Mar 6, 2025 | BenchmarkingGPU | —Unverified | 0 |
| Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum Systems | Mar 5, 2025 | BenchmarkingCPU | CodeCode Available | 0 |
| Towards Universal Learning-based Model for Cardiac Image Reconstruction: Summary of the CMRxRecon2024 Challenge | Mar 5, 2025 | BenchmarkingImage Reconstruction | —Unverified | 0 |
| AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber Attacks | Mar 5, 2025 | Benchmarkinggraph construction | CodeCode Available | 0 |
| GNNMerge: Merging of GNN Models Without Accessing Training Data | Mar 5, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 |
| Technical report of a DMD-based Characterization Method for Vision Sensors | Mar 4, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| Evaluation of Architectural Synthesis Using Generative AI | Mar 4, 2025 | Benchmarking | —Unverified | 0 |
| A2Perf: Real-World Autonomous Agents Benchmark | Mar 4, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Optimizing open-domain question answering with graph-based retrieval augmented generation | Mar 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics | Mar 3, 2025 | BenchmarkingSpoken Dialogue Systems | —Unverified | 0 |
| MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages | Mar 3, 2025 | Benchmarking | CodeCode Available | 0 |
| Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models | Mar 3, 2025 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| Multi-Agent Reinforcement Learning with Long-Term Performance Objectives for Service Workforce Optimization | Mar 3, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| FunBench: Benchmarking Fundus Reading Skills of MLLMs | Mar 2, 2025 | AnatomyBenchmarking | —Unverified | 0 |
| MAPS: Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Infrastructure | Mar 2, 2025 | Benchmarking | —Unverified | 0 |
| Towards Efficient Educational Chatbots: Benchmarking RAG Frameworks | Mar 2, 2025 | BenchmarkingChatbot | —Unverified | 0 |
| A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information | Mar 1, 2025 | Benchmarking | —Unverified | 0 |
| Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series | Feb 28, 2025 | BenchmarkingSolar Irradiance Forecasting | —Unverified | 0 |
| Large Language Model-Based Benchmarking Experiment Settings for Evolutionary Multi-Objective Optimization | Feb 28, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| NeuroMorse: A Temporally Structured Dataset For Neuromorphic Computing | Feb 28, 2025 | Benchmarking | CodeCode Available | 0 |
| ProBench: Benchmarking Large Language Models in Competitive Programming | Feb 28, 2025 | AttributeBenchmarking | —Unverified | 0 |
| PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice | Feb 28, 2025 | BenchmarkingDiagnostic | —Unverified | 0 |
| ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments | Feb 27, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches | Feb 27, 2025 | BenchmarkingPhotoplethysmography (PPG) | CodeCode Available | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping | Feb 27, 2025 | Benchmarking | CodeCode Available | 0 |
| Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review | Feb 26, 2025 | BenchmarkingText Detection | —Unverified | 0 |
| Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10 | Feb 26, 2025 | Benchmarkingobject-detection | —Unverified | 0 |
| Modelling Regional Solar Photovoltaic Capacity in Great Britain | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Agentic Mixture-of-Workflows for Multi-Modal Chemical Search | Feb 26, 2025 | BenchmarkingRetrieval | —Unverified | 0 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning | Feb 25, 2025 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers | Feb 25, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs | Feb 25, 2025 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization | Feb 25, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation | Feb 25, 2025 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering | Feb 24, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy | Feb 24, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement | Feb 24, 2025 | Benchmarkingfeature selection | —Unverified | 0 |
| Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties | Feb 24, 2025 | Benchmarking | CodeCode Available | 0 |
| Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking | Feb 24, 2025 | Benchmarking | —Unverified | 0 |
| On Neural Inertial Classification Networks for Pedestrian Activity Recognition | Feb 23, 2025 | Activity RecognitionBenchmarking | —Unverified | 0 |
| An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science | Feb 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs | Feb 23, 2025 | Benchmarking | —Unverified | 0 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |