| Contemporary Symbolic Regression Methods and their Relative Performance | Jul 29, 2021 | Benchmarkingparameter estimation | CodeCode Available | 1 |
| Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms | Nov 30, 2023 | BenchmarkingOpenAI Gym | CodeCode Available | 1 |
| LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs | Apr 11, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification | Jul 6, 2023 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games | Aug 25, 2021 | BenchmarkingVideo Classification | CodeCode Available | 1 |
| Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle? | Sep 29, 2023 | BenchmarkingKnowledge Graph Completion | CodeCode Available | 1 |
| CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models | Nov 27, 2024 | BenchmarkingEarth Observation | CodeCode Available | 1 |
| LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond | Oct 13, 2024 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| Benchmarking Image Retrieval for Visual Localization | Nov 24, 2020 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| ArabicaQA: A Comprehensive Dataset for Arabic Question Answering | Mar 26, 2024 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 |
| A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models | Apr 22, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 1 |
| Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs | Sep 18, 2021 | BenchmarkingComplex Query Answering | CodeCode Available | 1 |
| Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA | Dec 29, 2023 | AnatomyBenchmarking | CodeCode Available | 1 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets | Dec 10, 2021 | Benchmarking | CodeCode Available | 1 |
| ReMeDi: Resources for Multi-domain, Multi-service, Medical Dialogues | Sep 1, 2021 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing via Compositional Dependencies | Jun 15, 2025 | Benchmarking | CodeCode Available | 1 |
| Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection | Apr 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| Boosting Neural Image Compression for Machines Using Latent Space Masking | Dec 15, 2021 | BenchmarkingImage Compression | CodeCode Available | 1 |
| Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets | Jan 29, 2024 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection | May 30, 2022 | 3D Object DetectionAutonomous Driving | CodeCode Available | 1 |
| MALPOLON: A Framework for Deep Species Distribution Modeling | Sep 26, 2024 | BenchmarkingGPU | CodeCode Available | 1 |
| AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models | Jun 24, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 |
| High-Dimensional Inference in Bayesian Networks | Dec 16, 2021 | BenchmarkingVocal Bursts Intensity Prediction | CodeCode Available | 1 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |