| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| A Comparative Attention Framework for Better Few-Shot Object Detection on Aerial Images | Oct 25, 2022 | BenchmarkingFew-Shot Object Detection | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters | Jul 5, 2024 | Benchmarkingvalid | CodeCode Available | 1 | 5 |
| Beyond neural scaling laws: beating power law scaling via data pruning | Jun 29, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| A Closer Look at Mortality Risk Prediction from Electrocardiograms | Jun 24, 2024 | BenchmarkingPrediction | CodeCode Available | 1 | 5 |
| HINT3: Raising the bar for Intent Detection in the Wild | Sep 29, 2020 | BenchmarkingIntent Detection | CodeCode Available | 1 | 5 |
| A global analysis of metrics used for measuring performance in natural language processing | Apr 25, 2022 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models | Mar 31, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 | 5 |
| BiBench: Benchmarking and Analyzing Network Binarization | Jan 26, 2023 | BenchmarkingBinarization | CodeCode Available | 1 | 5 |
| A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging | Apr 26, 2020 | BenchmarkingLeft Atrium Segmentation | CodeCode Available | 1 | 5 |
| Benchmarking Multidomain English-Indonesian Machine Translation | May 1, 2020 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| Automatic Detection of Generated Text is Easiest when Humans are Fooled | Nov 2, 2019 | BenchmarkingLanguage Modelling | CodeCode Available | 1 | 5 |
| RGB-D Indiscernible Object Counting in Underwater Scenes | Apr 23, 2023 | BenchmarkingDepth Estimation | CodeCode Available | 1 | 5 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| GRecX: An Efficient and Unified Benchmark for GNN-based Recommendation | Nov 19, 2021 | BenchmarkingManagement | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for News Summarization | Jan 31, 2023 | BenchmarkingNews Summarization | CodeCode Available | 1 | 5 |
| Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus | Oct 18, 2022 | ARCBenchmarking | CodeCode Available | 1 | 5 |
| GraphWorld: Fake Graphs Bring Real Insights for GNNs | Feb 28, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models | Dec 15, 2023 | BenchmarkingCode Summarization | CodeCode Available | 1 | 5 |
| GraphGallery: A Platform for Fast Benchmarking and Easy Development of Graph Neural Networks Based Intelligent Software | Feb 16, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Biomedical Data-to-Text Generation via Fine-Tuning Transformers | Sep 3, 2021 | BenchmarkingData-to-Text Generation | CodeCode Available | 1 | 5 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 | 5 |
| Graph Neural Network-Based Anomaly Detection for River Network Systems | Apr 19, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences | May 28, 2024 | BenchmarkingFeature Engineering | CodeCode Available | 1 | 5 |
| BLADE: Benchmarking Language Model Agents for Data-Driven Science | Aug 19, 2024 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Benchmarking Simulation-Based Inference | Jan 12, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Visual Localization for Autonomous Navigation | Mar 24, 2022 | Autonomous NavigationBenchmarking | CodeCode Available | 1 | 5 |
| A skeletonization algorithm for gradient-based optimization | Sep 5, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Benchmarking Multi-Scene Fire and Smoke Detection | Oct 22, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses | Mar 3, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions | May 27, 2022 | BenchmarkingFew-Shot Image Classification | CodeCode Available | 1 | 5 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Boosting Neural Image Compression for Machines Using Latent Space Masking | Dec 15, 2021 | BenchmarkingImage Compression | CodeCode Available | 1 | 5 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine Learning | Nov 8, 2021 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |
| BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text | Apr 28, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Grounding Descriptions in Images informs Zero-Shot Visual Recognition | Dec 5, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| AI Accelerator Survey and Trends | Sep 18, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset | Jun 14, 2022 | BenchmarkingIschemic Stroke Lesion Segmentation | CodeCode Available | 1 | 5 |
| Benchmarking Neural Network Generalization for Grammar Induction | Aug 16, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations | Jul 4, 2018 | Adversarial DefenseBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond | Jun 16, 2023 | BenchmarkingEvidence Selection | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT | Apr 3, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 | 5 |
| GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking | Oct 3, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 | 5 |
| GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking | May 28, 2025 | BenchmarkingText Spotting | CodeCode Available | 1 | 5 |
| GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models | Jul 3, 2024 | Benchmarking | CodeCode Available | 1 | 5 |