| Kvasir-Instrument: Diagnostic and therapeutic tool segmentation dataset in gastrointestinal endoscopy | Oct 23, 2020 | BenchmarkingDiagnostic | CodeCode Available | 1 | 5 |
| Just Rank: Rethinking Evaluation with Word and Sentence Similarities | Mar 5, 2022 | BenchmarkingSemantic Similarity | CodeCode Available | 1 | 5 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters | Oct 13, 2023 | BenchmarkingFairness | CodeCode Available | 1 | 5 |
| Beyond neural scaling laws: beating power law scaling via data pruning | Jun 29, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Bag of Tricks for Adversarial Training | Oct 1, 2020 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| BEND: Benchmarking DNA Language Models on biologically meaningful tasks | Nov 21, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| Leveraging Trust for Joint Multi-Objective and Multi-Fidelity Optimization | Dec 27, 2021 | Bayesian OptimizationBenchmarking | CodeCode Available | 1 | 5 |
| Beyond Normal: On the Evaluation of Mutual Information Estimators | Jun 19, 2023 | BenchmarkingDomain Generalization | CodeCode Available | 1 | 5 |
| Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave Imaging | Apr 22, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration | May 25, 2023 | BenchmarkingFace Recognition | CodeCode Available | 1 | 5 |
| RobustBench: a standardized adversarial robustness benchmark | Oct 19, 2020 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Graph Neural Networks on Dynamic Link Prediction | Sep 29, 2021 | BenchmarkingDynamic Link Prediction | CodeCode Available | 1 | 5 |
| Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages | Mar 11, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| Benchmarking Graph Neural Networks for FMRI analysis | Nov 16, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Exploring Large Language Models for Classical Philology | May 23, 2023 | BenchmarkingDecoder | CodeCode Available | 1 | 5 |
| EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions | Jun 8, 2021 | Bayesian OptimisationBenchmarking | CodeCode Available | 1 | 5 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 | 5 |
| Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models | Jul 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts | Nov 7, 2023 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| RobFR: Benchmarking Adversarial Robustness on Face Recognition | Jul 8, 2020 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems | Jun 28, 2021 | 3D ReconstructionBenchmarking | CodeCode Available | 1 | 5 |
| MatTools: Benchmarking Large Language Models for Materials Science Tools | May 16, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods | Jun 15, 2023 | BenchmarkingFairness | CodeCode Available | 1 | 5 |
| Benchmarking Knowledge-driven Zero-shot Learning | Jun 29, 2021 | AttributeBenchmarking | CodeCode Available | 1 | 5 |