| Working Memory Capacity of ChatGPT: An Empirical Study | Apr 30, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation | Mar 7, 2024 | BenchmarkingMultimodal Recommendation | CodeCode Available | 1 | 5 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of 3D Object Detection to Common Corruptions | Jan 1, 2023 | 3D Object DetectionAutonomous Driving | CodeCode Available | 1 | 5 |
| A Comparison of Image Denoising Methods | Apr 18, 2023 | BenchmarkingDenoising | CodeCode Available | 1 | 5 |
| Formalizing Multimedia Recommendation through Multimodal Deep Learning | Sep 11, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Continual Learning with Foundation Models: An Empirical Study of Latent Replay | Apr 30, 2022 | BenchmarkingContinual Learning | CodeCode Available | 1 | 5 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 | 5 |
| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks | Apr 5, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| Earnings-22: A Practical Benchmark for Accents in the Wild | Mar 29, 2022 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | CodeCode Available | 1 | 5 |
| FNBench: Benchmarking Robust Federated Learning against Noisy Labels | May 10, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089 | Nov 6, 2023 | BenchmarkingKnowledge Base Question Answering | CodeCode Available | 1 | 5 |
| Benchmarking Reinforcement Learning Techniques for Autonomous Navigation | Oct 10, 2022 | Autonomous NavigationBenchmarking | CodeCode Available | 1 | 5 |
| EBES: Easy Benchmarking for Event Sequences | Oct 4, 2024 | Benchmarking | CodeCode Available | 1 | 5 |
| AI Accelerator Survey and Trends | Sep 18, 2021 | BenchmarkingComputational Efficiency | CodeCode Available | 1 | 5 |
| FM-TS: Flow Matching for Time Series Generation | Nov 12, 2024 | BenchmarkingImputation | CodeCode Available | 1 | 5 |
| FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding | Sep 28, 2023 | BenchmarkingImage Retrieval | CodeCode Available | 1 | 5 |
| EDFace-Celeb-1M: Benchmarking Face Hallucination with a Million-scale Dataset | Oct 11, 2021 | BenchmarkingFace Hallucination | CodeCode Available | 1 | 5 |
| EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios | May 22, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Flames: Benchmarking Value Alignment of LLMs in Chinese | Nov 12, 2023 | BenchmarkingFairness | CodeCode Available | 1 | 5 |
| Benchmarking Quantized Neural Networks on FPGAs with FINN | Feb 2, 2021 | BenchmarkingQuantization | CodeCode Available | 1 | 5 |
| Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining | Nov 22, 2017 | Benchmarkingfeature selection | CodeCode Available | 1 | 5 |
| AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses | Mar 3, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation | May 27, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis | Mar 9, 2021 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Foundation Model of Electronic Medical Records for Adaptive Risk Estimation | Feb 10, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| A skeletonization algorithm for gradient-based optimization | Sep 5, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| Benchmarking Visual Localization for Autonomous Navigation | Mar 24, 2022 | Autonomous NavigationBenchmarking | CodeCode Available | 1 | 5 |
| FiFAR: A Fraud Detection Dataset for Learning to Defer | Dec 20, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| A GPU-accelerated Large-scale Simulator for Transportation System Optimization Benchmarking | Jun 15, 2024 | BenchmarkingGPU | CodeCode Available | 1 | 5 |
| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| A Comparative Visual Analytics Framework for Evaluating Evolutionary Processes in Multi-objective Optimization | Aug 10, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding | Sep 27, 2021 | BenchmarkingNatural Language Understanding | CodeCode Available | 1 | 5 |
| Benchmarking emergency department triage prediction models with machine learning and large public electronic health records | Nov 22, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Pathology Feature Extractors for Whole Slide Image Classification | Nov 20, 2023 | Benchmarkingimage-classification | CodeCode Available | 1 | 5 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods | Jun 15, 2023 | BenchmarkingFairness | CodeCode Available | 1 | 5 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction | Jul 25, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 | 5 |
| A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging | Apr 26, 2020 | BenchmarkingLeft Atrium Segmentation | CodeCode Available | 1 | 5 |
| A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models | Mar 31, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 | 5 |
| A global analysis of metrics used for measuring performance in natural language processing | Apr 25, 2022 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces | May 23, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data | Mar 7, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| Benchmarking: Past, Present and Future | Aug 1, 2021 | BenchmarkingReading Comprehension | CodeCode Available | 1 | 5 |
| FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks | Nov 22, 2021 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| A Comparative Attention Framework for Better Few-Shot Object Detection on Aerial Images | Oct 25, 2022 | BenchmarkingFew-Shot Object Detection | CodeCode Available | 1 | 5 |
| ArtFID: Quantitative Evaluation of Neural Style Transfer | Jul 25, 2022 | BenchmarkingMeta-Learning | CodeCode Available | 1 | 5 |