| Working Memory Capacity of ChatGPT: An Empirical Study | Apr 30, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica | Sep 6, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks | Nov 22, 2021 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 | 5 |
| featsel: A framework for benchmarking of feature selection algorithms and cost functions | Jul 19, 2017 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things | Sep 29, 2023 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| RADAR: Benchmarking Language Models on Imperfect Tabular Data | Jun 9, 2025 | BenchmarkingMissing Values | CodeCode Available | 1 | 5 |
| Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models? | Aug 14, 2023 | BenchmarkingDrug Design | CodeCode Available | 1 | 5 |
| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| DomainLab: A modular Python package for domain generalization in deep learning | Mar 21, 2024 | BenchmarkingDomain Generalization | CodeCode Available | 1 | 5 |
| Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints | May 13, 2022 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? | Apr 29, 2024 | Answer GenerationBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking: Past, Present and Future | Aug 1, 2021 | BenchmarkingReading Comprehension | CodeCode Available | 1 | 5 |
| Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089 | Nov 6, 2023 | BenchmarkingKnowledge Base Question Answering | CodeCode Available | 1 | 5 |
| Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension | Mar 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms | Aug 25, 2017 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 1 | 5 |
| A Comparison of Image Denoising Methods | Apr 18, 2023 | BenchmarkingDenoising | CodeCode Available | 1 | 5 |
| Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation | May 30, 2025 | AllBenchmarking | CodeCode Available | 1 | 5 |
| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 | 5 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Offline Reinforcement Learning on Real-Robot Hardware | Jul 28, 2023 | Benchmarkingreinforcement-learning | CodeCode Available | 1 | 5 |
| AI Accelerator Survey and Trends | Sep 18, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| EXPObench: Benchmarking Surrogate-based Optimisation Algorithms on Expensive Black-box Functions | Jun 8, 2021 | Bayesian OptimisationBenchmarking | CodeCode Available | 1 | 5 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Object Detectors with COCO: A New Path Forward | Mar 27, 2024 | BenchmarkingObject | CodeCode Available | 1 | 5 |