| ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments | Feb 27, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Machine-learning for photoplethysmography analysis: Benchmarking feature, image, and signal-based approaches | Feb 27, 2025 | BenchmarkingPhotoplethysmography (PPG) | CodeCode Available | 0 |
| MMSciBench: Benchmarking Language Models on Multimodal Scientific Problems | Feb 27, 2025 | BenchmarkingVisual Reasoning | —Unverified | 0 |
| LimeSoDa: A Dataset Collection for Benchmarking of Machine Learning Regressors in Digital Soil Mapping | Feb 27, 2025 | Benchmarking | CodeCode Available | 0 |
| Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review | Feb 26, 2025 | BenchmarkingText Detection | —Unverified | 0 |
| Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10 | Feb 26, 2025 | Benchmarkingobject-detection | —Unverified | 0 |
| Modelling Regional Solar Photovoltaic Capacity in Great Britain | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Agentic Mixture-of-Workflows for Multi-Modal Chemical Search | Feb 26, 2025 | BenchmarkingRetrieval | —Unverified | 0 |
| MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering | Feb 26, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors | Feb 26, 2025 | Benchmarking | —Unverified | 0 |
| Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval | Feb 26, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning | Feb 25, 2025 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers | Feb 25, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| CayleyPy RL: Pathfinding and Reinforcement Learning on Cayley Graphs | Feb 25, 2025 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| A Real-time Spatio-Temporal Trajectory Planner for Autonomous Vehicles with Semantic Graph Optimization | Feb 25, 2025 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation | Feb 25, 2025 | BenchmarkingSemantic Segmentation | —Unverified | 0 |
| MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering | Feb 24, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| SynthRAD2025 Grand Challenge dataset: generating synthetic CTs for radiotherapy | Feb 24, 2025 | BenchmarkingImage Generation | —Unverified | 0 |
| Enhancing Image Matting in Real-World Scenes with Mask-Guided Iterative Refinement | Feb 24, 2025 | Benchmarkingfeature selection | —Unverified | 0 |
| Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties | Feb 24, 2025 | Benchmarking | CodeCode Available | 0 |
| Overconfident Oracles: Limitations of In Silico Sequence Design Benchmarking | Feb 24, 2025 | Benchmarking | —Unverified | 0 |
| On Neural Inertial Classification Networks for Pedestrian Activity Recognition | Feb 23, 2025 | Activity RecognitionBenchmarking | —Unverified | 0 |
| An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science | Feb 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 |
| VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs | Feb 23, 2025 | Benchmarking | —Unverified | 0 |
| VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models | Feb 23, 2025 | BenchmarkingSpatial Reasoning | CodeCode Available | 0 |