| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 | 5 |
| ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning | Sep 27, 2024 | AutoMLBenchmarking | CodeCode Available | 1 | 5 |
| GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks | Mar 23, 2025 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Benchmarking saliency methods for chest X-ray interpretation | Oct 10, 2022 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| 3D Common Corruptions and Data Augmentation | Mar 2, 2022 | BenchmarkingData Augmentation | CodeCode Available | 1 | 5 |
| GenISP: Neural ISP for Low-Light Machine Cognition | May 7, 2022 | BenchmarkingImage Restoration | CodeCode Available | 1 | 5 |
| AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents | Apr 9, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Self-Supervised Learning on Diverse Pathology Datasets | Dec 9, 2022 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Geoclidean: Few-Shot Generalization in Euclidean Geometry | Nov 30, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking Platform | Jul 15, 2020 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous graph neural networks | Dec 30, 2021 | BenchmarkingHeterogeneous Node Classification | CodeCode Available | 1 | 5 |
| From Claims to Evidence: A Unified Framework and Critical Analysis of CNN vs. Transformer vs. Mamba in Medical Image Segmentation | Mar 3, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Text-Image Composed Retrieval | Nov 24, 2023 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios | May 22, 2025 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Benchmarking Robustness to Adversarial Image Obfuscations | Jan 30, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| GENEVA: Benchmarking Generalizability for Event Argument Extraction with Hundreds of Event Types and Argument Roles | May 25, 2022 | BenchmarkingEvent Argument Extraction | CodeCode Available | 1 | 5 |
| Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs | Nov 29, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Generative Evaluation of Complex Reasoning in Large Language Models | Apr 3, 2025 | BenchmarkingMemorization | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Machine Reading Comprehension Models | Apr 29, 2020 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| 3D AffordanceNet: A Benchmark for Visual Object Affordance Understanding | Mar 30, 2021 | Affordance DetectionBenchmarking | CodeCode Available | 1 | 5 |
| Generative CKM Construction using Partially Observed Data with Diffusion Model | Dec 19, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based Method | Aug 19, 2021 | BenchmarkingSynthetic Data Generation | CodeCode Available | 1 | 5 |
| GenFace: A Large-Scale Fine-Grained Face Forgery Benchmark and Cross Appearance-Edge Learning | Feb 3, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 | 5 |
| German's Next Language Model | Oct 21, 2020 | BenchmarkingDocument Classification | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of 3D Object Detection to Common Corruptions | Jan 1, 2023 | 3D Object DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 | 5 |
| Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study | Feb 26, 2025 | BenchmarkingBlood pressure estimation | CodeCode Available | 1 | 5 |
| A Review and Efficient Implementation of Scene Graph Generation Metrics | Apr 15, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 | 5 |
| GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models | Jun 1, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining | Nov 22, 2017 | Benchmarkingfeature selection | CodeCode Available | 1 | 5 |
| 2.5D Visual Relationship Detection | Apr 26, 2021 | BenchmarkingDepth Estimation | CodeCode Available | 1 | 5 |
| General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design | Jun 24, 2024 | BenchmarkingDrug Design | CodeCode Available | 1 | 5 |
| Generating a Doppelganger Graph: Resembling but Distinct | Jan 23, 2021 | BenchmarkingGraph Representation Learning | CodeCode Available | 1 | 5 |
| GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 | 5 |
| GEMv2: Multilingual NLG Benchmarking in a Single Line of Code | Jun 22, 2022 | BenchmarkingText Generation | CodeCode Available | 1 | 5 |
| GAMA: a General Automated Machine learning Assistant | Jul 9, 2020 | AutoMLBenchmarking | CodeCode Available | 1 | 5 |
| GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection | Jul 16, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks | Sep 29, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Quantized Neural Networks on FPGAs with FINN | Feb 2, 2021 | BenchmarkingQuantization | CodeCode Available | 1 | 5 |
| GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection | Jun 21, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 | 5 |
| GCondenser: Benchmarking Graph Condensation | May 23, 2024 | BenchmarkingGraph Representation Learning | CodeCode Available | 1 | 5 |
| Benchmarking emergency department triage prediction models with machine learning and large public electronic health records | Nov 22, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| FTNet: Feature Transverse Network for Thermal Image Semantic Segmentation | Oct 26, 2021 | BenchmarkingScene Segmentation | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Benchmarking Large Multimodal Models against Common Corruptions | Jan 22, 2024 | BenchmarkingImage to text | CodeCode Available | 1 | 5 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow | May 23, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |