| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge Graphs | Sep 18, 2021 | BenchmarkingComplex Query Answering | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus | Oct 18, 2022 | ARCBenchmarking | CodeCode Available | 1 | 5 |
| Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs | Nov 29, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining | Nov 22, 2017 | Benchmarkingfeature selection | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA | Dec 29, 2023 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness to Adversarial Image Obfuscations | Jan 30, 2023 | Benchmarking | CodeCode Available | 1 | 5 |