| M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection | May 16, 2025 | Benchmarkingobject-detection | CodeCode Available | 1 |
| MatTools: Benchmarking Large Language Models for Materials Science Tools | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Evaluating Robustness of Deep Reinforcement Learning for Autonomous Surface Vehicle Control in Field Tests | May 15, 2025 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 1 |
| Words That Unite The World: A Unified Framework for Deciphering Central Bank Communications Globally | May 15, 2025 | BenchmarkingSentence | CodeCode Available | 1 |
| Towards scalable surrogate models based on Neural Fields for large scale aerodynamic simulations | May 14, 2025 | Benchmarking | CodeCode Available | 1 |
| OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions | May 14, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| FNBench: Benchmarking Robust Federated Learning against Noisy Labels | May 10, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes | May 10, 2025 | BenchmarkingGPU | CodeCode Available | 1 |
| PyTDC: A multimodal machine learning training, evaluation, and inference platform for biomedical foundation models | May 8, 2025 | BenchmarkingGraph Representation Learning | CodeCode Available | 1 |
| scDrugMap: Benchmarking Large Foundation Models for Drug Response Prediction | May 8, 2025 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | May 8, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| RGB-Event Fusion with Self-Attention for Collision Prediction | May 7, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video | May 4, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation | Apr 30, 2025 | 3D Molecule GenerationBenchmarking | CodeCode Available | 1 |
| TrueFake: A Real World Case Dataset of Last Generation Fake Images also Shared on Social Networks | Apr 29, 2025 | BenchmarkingMisinformation | CodeCode Available | 1 |
| OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification | Apr 29, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text | Apr 28, 2025 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 |
| LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement | Apr 22, 2025 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| TinyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic Programming | Apr 14, 2025 | BenchmarkingProgram Synthesis | CodeCode Available | 1 |
| LEMUR Neural Network Dataset: Towards Seamless AutoML | Apr 14, 2025 | AutoMLBenchmarking | CodeCode Available | 1 |