| GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras | Jun 11, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking saliency methods for chest X-ray interpretation | Oct 10, 2022 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Global Wheat Head Detection (GWHD) dataset: a large and diverse dataset of high resolution RGB labelled images to develop and benchmark wheat head detection methods | Apr 25, 2020 | BenchmarkingHead Detection | CodeCode Available | 1 | 5 |
| Benchmarking Spatial Relationships in Text-to-Image Generation | Dec 20, 2022 | BenchmarkingImage Generation | CodeCode Available | 1 | 5 |
| GraphGallery: A Platform for Fast Benchmarking and Easy Development of Graph Neural Networks Based Intelligent Software | Feb 16, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Self-Supervised Learning on Diverse Pathology Datasets | Dec 9, 2022 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| A Review and Efficient Implementation of Scene Graph Generation Metrics | Apr 15, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 | 5 |
| Benchmarking Simulation-Based Inference | Jan 12, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Graphs, Constraints, and Search for the Abstraction and Reasoning Corpus | Oct 18, 2022 | ARCBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations | Apr 15, 2024 | BenchmarkingBias Detection | CodeCode Available | 1 | 5 |
| Grad DFT: a software library for machine learning enhanced density functional theory | Sep 23, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of 3D Object Detection to Common Corruptions | Jan 1, 2023 | 3D Object DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Benchmarking Spectral Graph Neural Networks: A Comprehensive Study on Effectiveness and Efficiency | Jun 14, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks | Mar 23, 2025 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle? | Sep 29, 2023 | BenchmarkingKnowledge Graph Completion | CodeCode Available | 1 | 5 |
| Benchmarking Test-Time Adaptation against Distribution Shifts in Image Classification | Jul 6, 2023 | BenchmarkingDomain Adaptation | CodeCode Available | 1 | 5 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Benchmarking LLMs for Political Science: A United Nations Perspective | Feb 19, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Benchmarking the Generation of Fact Checking Explanations | Aug 29, 2023 | Abstractive Text SummarizationArticles | CodeCode Available | 1 | 5 |
| Geoclidean: Few-Shot Generalization in Euclidean Geometry | Nov 30, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate | Aug 12, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs | Nov 29, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Benchmarking Local Robustness of High-Accuracy Binary Neural Networks for Enhanced Traffic Sign Recognition | Sep 25, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science Domains | May 23, 2021 | Active LearningBayesian Optimisation | CodeCode Available | 1 | 5 |
| Benchmarking Low-Shot Robustness to Natural Distribution Shifts | Apr 21, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Are We There Yet? Evaluating State-of-the-Art Neural Network based Geoparsers Using EUPEG as a Benchmarking Platform | Jul 15, 2020 | ArticlesBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents | Apr 9, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions | Mar 29, 2024 | Action DetectionBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Machine Reading Comprehension Models | Apr 29, 2020 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| Benchmarking machine learning models on multi-centre eICU critical care dataset | Oct 2, 2019 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 1 | 5 |
| German's Next Language Model | Oct 21, 2020 | BenchmarkingDocument Classification | CodeCode Available | 1 | 5 |
| GraphArena: Benchmarking Large Language Models on Graph Computational Problems | Jun 29, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Jan 28, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 1 | 5 |
| Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding | Nov 6, 2023 | BenchmarkingData Compression | CodeCode Available | 1 | 5 |
| Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data | Feb 27, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Meaning Representations in Neural Semantic Parsing | Nov 1, 2020 | BenchmarkingSemantic Parsing | CodeCode Available | 1 | 5 |
| ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning | Sep 27, 2024 | AutoMLBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Meta-embeddings: What Works and What Does Not | Nov 1, 2021 | BenchmarkingEmbeddings Evaluation | CodeCode Available | 1 | 5 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 | 5 |
| Benchmarking Micro-action Recognition: Dataset, Methods, and Applications | Mar 8, 2024 | Action RecognitionBenchmarking | CodeCode Available | 1 | 5 |
| Generative Wind Power Curve Modeling Via Machine Vision: A Self-learning Deep Convolutional Network Based Method | Aug 19, 2021 | BenchmarkingSynthetic Data Generation | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for News Summarization | Jan 31, 2023 | BenchmarkingNews Summarization | CodeCode Available | 1 | 5 |