| Benchmarking Image Retrieval for Visual Localization | Nov 24, 2020 | Autonomous DrivingBenchmarking | CodeCode Available | 1 | 5 |
| Quantitative Certification of Bias in Large Language Models | May 29, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| ArabicaQA: A Comprehensive Dataset for Arabic Question Answering | Mar 26, 2024 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning | Dec 11, 2023 | BenchmarkingHuman-Object Interaction Detection | CodeCode Available | 1 | 5 |
| Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all | Oct 17, 2024 | AllBenchmarking | CodeCode Available | 1 | 5 |
| AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery | Oct 31, 2024 | BenchmarkingCloud Removal | CodeCode Available | 1 | 5 |
| Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems | Jun 28, 2021 | 3D ReconstructionBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking tree species classification from proximally-sensed laser scanning data: introducing the FOR-species20K dataset | Aug 12, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters | Oct 13, 2023 | BenchmarkingFairness | CodeCode Available | 1 | 5 |
| EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography | Oct 31, 2024 | BenchmarkingElectromyography (EMG) | CodeCode Available | 1 | 5 |
| Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets | Dec 10, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration | May 25, 2023 | BenchmarkingFace Recognition | CodeCode Available | 1 | 5 |
| EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence | Nov 1, 2023 | BenchmarkingCryogenic Electron Microscopy (cryo-EM) | CodeCode Available | 1 | 5 |
| KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi | Oct 23, 2020 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| Beyond neural scaling laws: beating power law scaling via data pruning | Jun 29, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 | 5 |
| Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models | Dec 15, 2023 | BenchmarkingCode Summarization | CodeCode Available | 1 | 5 |
| Beyond Normal: On the Evaluation of Mutual Information Estimators | Jun 19, 2023 | BenchmarkingDomain Generalization | CodeCode Available | 1 | 5 |
| Enhancing Biomedical Relation Extraction with Directionality | Jan 23, 2025 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 1 | 5 |
| Enhancing Ligand Pose Sampling for Molecular Docking | Nov 30, 2023 | BenchmarkingMolecular Docking | CodeCode Available | 1 | 5 |
| scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data | Jun 10, 2025 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| Just Rank: Rethinking Evaluation with Word and Sentence Similarities | Mar 5, 2022 | BenchmarkingSemantic Similarity | CodeCode Available | 1 | 5 |
| Autonomous Reinforcement Learning: Formalism and Benchmarking | Dec 17, 2021 | Benchmarkingreinforcement-learning | CodeCode Available | 1 | 5 |
| Labelling unlabelled videos from scratch with multi-modal self-supervision | Jun 24, 2020 | BenchmarkingClustering | CodeCode Available | 1 | 5 |
| JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching | Feb 5, 2024 | BenchmarkingSentence | CodeCode Available | 1 | 5 |
| ENRICH: Multi-purposE dataset for beNchmaRking In Computer vision and pHotogrammetry | Apr 1, 2023 | 3D Reconstruction3D Scene Reconstruction | CodeCode Available | 1 | 5 |
| Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | May 8, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 | 5 |
| Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks | Nov 4, 2024 | Action GenerationBenchmarking | CodeCode Available | 1 | 5 |
| Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking | Jun 17, 2024 | BenchmarkingDemand Forecasting | CodeCode Available | 1 | 5 |
| SHARP: Environment and Person Independent Activity Recognition with Commodity IEEE 802.11 Access Points | Mar 17, 2021 | Activity RecognitionBenchmarking | CodeCode Available | 1 | 5 |
| JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning | Jul 21, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 | 5 |
| A Critical Assessment of State-of-the-Art in Entity Alignment | Oct 30, 2020 | BenchmarkingEntity Alignment | CodeCode Available | 1 | 5 |
| Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset | Nov 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems | Mar 19, 2024 | Benchmarkingfeature selection | CodeCode Available | 1 | 5 |
| Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Jul 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes | May 10, 2025 | BenchmarkingGPU | CodeCode Available | 1 | 5 |
| Jojajovai: A Parallel Guarani-Spanish Corpus for MT Benchmarking | Jun 1, 2022 | BenchmarkingSentence | CodeCode Available | 1 | 5 |
| Best practices for constructing, preparing, and evaluating protein-ligand binding affinity benchmarks | May 13, 2021 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| AQuA: A Benchmarking Tool for Label Quality Assessment | Jun 15, 2023 | BenchmarkingLabel Error Detection | CodeCode Available | 1 | 5 |
| APTv2: Benchmarking Animal Pose Estimation and Tracking with a Large-scale Dataset and Beyond | Dec 25, 2023 | Animal Pose EstimationBenchmarking | CodeCode Available | 1 | 5 |
| Evaluating histopathology transfer learning with ChampKit | Jun 14, 2022 | BenchmarkingCell Detection | CodeCode Available | 1 | 5 |
| Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking | Jun 18, 2023 | BenchmarkingLink Prediction | CodeCode Available | 1 | 5 |
| BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models | Jun 2, 2023 | BenchmarkingLanguage Acquisition | CodeCode Available | 1 | 5 |
| Evaluating Multimodal Representations on Visual Semantic Textual Similarity | Apr 4, 2020 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| ISSAFE: Improving Semantic Segmentation in Accidents by Fusing Event-based Data | Aug 20, 2020 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 | 5 |
| Rethinking Machine Unlearning in Image Generation Models | Jun 3, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 | 5 |
| JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds | Nov 5, 2023 | Autonomous NavigationAutonomous Vehicles | CodeCode Available | 1 | 5 |
| Benchmark on Drug Target Interaction Modeling from a Structure Perspective | Jul 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks | Jul 26, 2024 | BenchmarkingModel Selection | CodeCode Available | 1 | 5 |
| Benchpress: A Scalable and Versatile Workflow for Benchmarking Structure Learning Algorithms | Jul 8, 2021 | Benchmarking | CodeCode Available | 1 | 5 |