| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |
| dEchorate: a Calibrated Room Impulse Response Database for Echo-aware Signal Processing | Apr 27, 2021 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Don’t be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System | Nov 1, 2021 | BenchmarkingResponse Generation | CodeCode Available | 1 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 |
| Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark | Jun 8, 2022 | BenchmarkingExplainable Artificial Intelligence (XAI) | CodeCode Available | 1 |
| CodeS: Natural Language to Code Repository via Multi-Layer Sketch | Mar 25, 2024 | Benchmarking | CodeCode Available | 1 |
| A Closer Look at Mortality Risk Prediction from Electrocardiograms | Jun 24, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Benchmarking MRI Reconstruction Neural Networks on Large Public Datasets | Mar 6, 2020 | BenchmarkingImage Reconstruction | CodeCode Available | 1 |
| A global analysis of metrics used for measuring performance in natural language processing | Apr 25, 2022 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models | Mar 31, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| Benchmarking Large Language Models for News Summarization | Jan 31, 2023 | BenchmarkingNews Summarization | CodeCode Available | 1 |
| A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging | Apr 26, 2020 | BenchmarkingLeft Atrium Segmentation | CodeCode Available | 1 |
| Benchmarking Multidomain English-Indonesian Machine Translation | May 1, 2020 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking | Jan 22, 2020 | Benchmarkingobject-detection | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework | Dec 7, 2022 | Benchmarking | CodeCode Available | 1 |
| COCO: The Large Scale Black-Box Optimization Benchmarking (bbob-largescale) Test Suite | Mar 15, 2019 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models | May 26, 2025 | BenchmarkingRAG | CodeCode Available | 1 |
| A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization | Aug 10, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale Dataset | Oct 11, 2021 | BenchmarkingFace Hallucination | CodeCode Available | 1 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| Benchmarking and scaling of deep learning models for land cover image classification | Nov 18, 2021 | BenchmarkingClassification | CodeCode Available | 1 |
| Codabench: Flexible, Easy-to-Use and Reproducible Benchmarking Platform | Oct 12, 2021 | Benchmarking | CodeCode Available | 1 |