| LLM-Based Multi-Agent Systems are Scalable Graph Generative Models | Oct 13, 2024 | BenchmarkingGraph Generation | CodeCode Available | 2 | 5 |
| DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering | Jul 15, 2025 | BenchmarkingInstruction Following | CodeCode Available | 2 | 5 |
| State-specific protein-ligand complex structure prediction with a multi-scale deep generative model | Sep 30, 2022 | BenchmarkingBlind Docking | CodeCode Available | 2 | 5 |
| Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing | Apr 3, 2025 | BenchmarkingLogical Reasoning | CodeCode Available | 2 | 5 |
| A Survey on Multimodal Benchmarks: In the Era of Large AI Models | Sep 21, 2024 | BenchmarkingSurvey | CodeCode Available | 2 | 5 |
| FedGraph: A Research Library and Benchmark for Federated Graph Learning | Oct 8, 2024 | BenchmarkingFederated Learning | CodeCode Available | 2 | 5 |
| Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer | Mar 21, 2025 | BenchmarkingVideo Generation | CodeCode Available | 2 | 5 |
| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 | 5 |
| Benchmarking Deep Reinforcement Learning for Continuous Control | Apr 22, 2016 | Action Triplet RecognitionAtari Games | CodeCode Available | 2 | 5 |
| Datasets and Benchmarks for Offline Safe Reinforcement Learning | Jun 15, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 | 5 |
| Deep Visual Geo-localization Benchmark | Apr 7, 2022 | BenchmarkingData Augmentation | CodeCode Available | 2 | 5 |
| Craftium: An Extensible Framework for Creating Reinforcement Learning Environments | Jul 4, 2024 | BenchmarkingMinecraft | CodeCode Available | 2 | 5 |
| A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning | Sep 26, 2023 | BenchmarkingMulti-Objective Reinforcement Learning | CodeCode Available | 2 | 5 |
| CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions | May 24, 2025 | Benchmarking | CodeCode Available | 2 | 5 |
| Benchmarking Robustness of 3D Point Cloud Recognition Against Common Corruptions | Jan 28, 2022 | 3D Point Cloud Classification3D Point Cloud Data Augmentation | CodeCode Available | 2 | 5 |
| OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs | May 9, 2024 | BenchmarkingFact Checking | CodeCode Available | 2 | 5 |
| Open Universal Arabic ASR Leaderboard | Dec 18, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 | 5 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? | Jan 9, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 2 | 5 |
| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 | 5 |
| Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint) | Jan 14, 2023 | Benchmarking | CodeCode Available | 2 | 5 |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | Jul 3, 2024 | BenchmarkingCode Search | CodeCode Available | 2 | 5 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 | 5 |
| Authorship Obfuscation in Multilingual Machine-Generated Text Detection | Jan 15, 2024 | Adversarial RobustnessBenchmarking | CodeCode Available | 2 | 5 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 | 5 |
| Benchmarking Synthetic Tabular Data: A Multi-Dimensional Evaluation Framework | Apr 2, 2025 | BenchmarkingSynthetic Data Generation | CodeCode Available | 2 | 5 |
| PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs | Jun 15, 2023 | Benchmarking | CodeCode Available | 2 | 5 |
| PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning | Jun 24, 2025 | BenchmarkingDrug Discovery | CodeCode Available | 2 | 5 |
| PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models | May 15, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| Commit0: Library Generation from Scratch | Dec 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 | 5 |
| ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons | Oct 11, 2023 | BenchmarkingPosition | CodeCode Available | 2 | 5 |
| Benchmarking Complex Instruction-Following with Multiple Constraints Composition | Jul 4, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 | 5 |
| Class-incremental Learning for Time Series: Benchmark and Evaluation | Feb 19, 2024 | Activity RecognitionBenchmarking | CodeCode Available | 2 | 5 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 | 5 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 | 5 |
| Benchmarking the Robustness of LiDAR Semantic Segmentation Models | Jan 3, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 | 5 |
| Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Jul 17, 2024 | BenchmarkingRed Teaming | CodeCode Available | 2 | 5 |
| Revealing data leakage in protein interaction benchmarks | Apr 16, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Feb 20, 2025 | Age EstimationBenchmarking | CodeCode Available | 2 | 5 |
| Learning to Fly -- a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-agent Quadcopter Control | Mar 3, 2021 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 2 | 5 |
| RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning | Apr 9, 2023 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 2 | 5 |
| REAL-Colon: A dataset for developing real-world AI applications in colonoscopy | Mar 4, 2024 | Benchmarking | CodeCode Available | 2 | 5 |
| Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph | Jun 21, 2024 | BenchmarkingText Generation | CodeCode Available | 2 | 5 |
| BARS: Towards Open Benchmarking for Recommender Systems | May 19, 2022 | BenchmarkingClick-Through Rate Prediction | CodeCode Available | 2 | 5 |
| Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach | Aug 31, 2019 | ArticlesBenchmarking | CodeCode Available | 2 | 5 |
| COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning | Jan 15, 2021 | BenchmarkingMisinformation | CodeCode Available | 1 | 5 |
| Category-wise Fine-Tuning: Resisting Incorrect Pseudo-Labels in Multi-Label Image Classification with Partial Labels | Jan 30, 2024 | Benchmarkingimage-classification | CodeCode Available | 1 | 5 |
| RADAR: Benchmarking Language Models on Imperfect Tabular Data | Jun 9, 2025 | BenchmarkingMissing Values | CodeCode Available | 1 | 5 |
| Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness Metrics | Jun 8, 2021 | Age And Gender ClassificationBenchmarking | CodeCode Available | 1 | 5 |