| FedGraph: A Research Library and Benchmark for Federated Graph Learning | Oct 8, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 |
| GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks | Nov 28, 2024 | BenchmarkingObject Counting | CodeCode Available | 2 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 |
| Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark | Mar 10, 2025 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization | Sep 24, 2024 | 3D geometry3DGS | CodeCode Available | 2 |
| GV-Bench: Benchmarking Local Feature Matching for Geometric Verification of Long-term Loop Closure Detection | Jul 16, 2024 | BenchmarkingLoop Closure Detection | CodeCode Available | 2 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |
| HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? | Jun 20, 2024 | BenchmarkingPoint Processes | CodeCode Available | 2 |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | Jul 1, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance | Jul 9, 2024 | BenchmarkingConditional Image Generation | CodeCode Available | 2 |
| Extended Agriculture-Vision: An Extension of a Large Aerial Image Dataset for Agricultural Pattern Analysis | Mar 4, 2023 | BenchmarkingContrastive Learning | CodeCode Available | 2 |
| IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer | Jul 27, 2023 | BenchmarkingImage Manipulation | CodeCode Available | 2 |
| InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks | Jan 10, 2024 | Benchmarking | CodeCode Available | 2 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Fast Vision Transformers with HiLo Attention | May 26, 2022 | BenchmarkingEfficient ViTs | CodeCode Available | 2 |
| FaceScore: Benchmarking and Enhancing Face Quality in Human Generation | Jun 24, 2024 | BenchmarkingDenoising | CodeCode Available | 2 |
| Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions | Aug 28, 2024 | Benchmarking | CodeCode Available | 2 |
| InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback | Jun 26, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| Benchmarking and Improving Detail Image Caption | May 29, 2024 | BenchmarkingImage Captioning | CodeCode Available | 2 |
| Investigating Tradeoffs in Real-World Video Super-Resolution | Nov 24, 2021 | BenchmarkingSuper-Resolution | CodeCode Available | 2 |
| Event-Based Motion Magnification | Feb 19, 2024 | BenchmarkingMotion Detection | CodeCode Available | 2 |
| Exponentially Faster Language Modelling | Nov 15, 2023 | BenchmarkingCPU | CodeCode Available | 2 |
| Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance | Feb 12, 2025 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 |
| LawBench: Benchmarking Legal Knowledge of Large Language Models | Sep 28, 2023 | ArticlesBenchmarking | CodeCode Available | 2 |
| Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control | Mar 3, 2021 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 |
| EV2Gym: A Flexible V2G Simulator for EV Smart Charging Research and Benchmarking | Apr 2, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 |
| Assessing SPARQL capabilities of Large Language Models | Sep 9, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 2 |
| LLaMEA-BO: A Large Language Model Evolutionary Algorithm for Automatically Generating Bayesian Optimization Algorithms | May 27, 2025 | Bayesian OptimizationBenchmarking | CodeCode Available | 2 |
| Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework | Sep 17, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 |
| LongEmbed: Extending Embedding Models for Long Context Retrieval | Apr 18, 2024 | 4k8k | CodeCode Available | 2 |
| A Survey on Graph Neural Networks for Remaining Useful Life Prediction: Methodologies, Evaluation and Future Trends | Sep 29, 2024 | Benchmarkinggraph construction | CodeCode Available | 2 |
| A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark | Jan 1, 2024 | Age EstimationBenchmarking | CodeCode Available | 2 |
| LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents | Feb 13, 2024 | BenchmarkingModel Selection | CodeCode Available | 2 |
| EvalGIM: A Library for Evaluating Generative Image Models | Dec 13, 2024 | BenchmarkingDiversity | CodeCode Available | 2 |
| EffiBench: Benchmarking the Efficiency of Automatically Generated Code | Feb 3, 2024 | BenchmarkingCode Completion | CodeCode Available | 2 |
| MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation | Aug 17, 2022 | BenchmarkingCode Generation | CodeCode Available | 2 |
| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 |
| A Survey on Multimodal Benchmarks: In the Era of Large AI Models | Sep 21, 2024 | BenchmarkingSurvey | CodeCode Available | 2 |
| Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Apr 3, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 2 |
| Evaluating Large-Vocabulary Object Detectors: The Devil is in the Details | Feb 1, 2021 | Benchmarkingobject-detection | CodeCode Available | 2 |
| MIBench: A Comprehensive Framework for Benchmarking Model Inversion Attack and Defense | Oct 7, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 |
| FluidLab: A Differentiable Environment for Benchmarking Complex Fluid Manipulation | Mar 4, 2023 | BenchmarkingGPU | CodeCode Available | 2 |
| DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation | Jun 24, 2024 | BenchmarkingImage Generation | CodeCode Available | 2 |
| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 |
| MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations | Jul 1, 2024 | Benchmarkingdocument understanding | CodeCode Available | 2 |
| MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | May 20, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator | Feb 15, 2024 | BenchmarkingDiagnostic | CodeCode Available | 2 |
| A Content-Driven Micro-Video Recommendation Dataset at Scale | Sep 27, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering | Jul 15, 2025 | BenchmarkingInstruction Following | CodeCode Available | 2 |