| Building reliable sim driving agents by scaling self-play | Feb 20, 2025 | Autonomous VehiclesBenchmarking | CodeCode Available | 4 | 5 |
| Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation | Feb 23, 2025 | Benchmarking | CodeCode Available | 4 | 5 |
| BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions | Jun 22, 2024 | BenchmarkingCode Generation | CodeCode Available | 4 | 5 |
| OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics | Jun 14, 2025 | Benchmarking | CodeCode Available | 4 | 5 |
| OpenAGI: When LLM Meets Domain Experts | Apr 10, 2023 | BenchmarkingNatural Language Queries | CodeCode Available | 4 | 5 |
| Pearl: A Production-ready Reinforcement Learning Agent | Dec 6, 2023 | Benchmarkingreinforcement-learning | CodeCode Available | 4 | 5 |
| OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | Dec 31, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 4 | 5 |
| Benchmarking Neural Network Training Algorithms | Jun 12, 2023 | Benchmarking | CodeCode Available | 4 | 5 |
| MTEB: Massive Text Embedding Benchmark | Oct 13, 2022 | BenchmarkingInformation Retrieval | CodeCode Available | 4 | 5 |
| Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation | Feb 4, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 4 | 5 |
| Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets | Mar 9, 2022 | BenchmarkingGraph Regression | CodeCode Available | 4 | 5 |
| LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit | May 9, 2024 | BenchmarkingComputational Efficiency | CodeCode Available | 4 | 5 |
| Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound | Feb 7, 2025 | Benchmarking | CodeCode Available | 4 | 5 |
| Aequitas Flow: Streamlining Fair ML Experimentation | May 9, 2024 | BenchmarkingFairness | CodeCode Available | 4 | 5 |
| Benchmarking Retrieval-Augmented Generation for Medicine | Feb 20, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 4 | 5 |
| Accelerating Data Processing and Benchmarking of AI Models for Pathology | Feb 10, 2025 | Benchmarking | CodeCode Available | 4 | 5 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 | 5 |
| Benchopt: Reproducible, efficient and collaborative optimization benchmarks | Jun 27, 2022 | Benchmarkingimage-classification | CodeCode Available | 4 | 5 |
| AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents | May 23, 2024 | Benchmarking | CodeCode Available | 4 | 5 |
| MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from Microwatts to Megawatts for Sustainable AI | Oct 15, 2024 | Benchmarking | CodeCode Available | 4 | 5 |
| Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders | Dec 23, 2024 | 3D Shape ModelingBenchmarking | CodeCode Available | 4 | 5 |
| Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving | Jun 6, 2024 | Autonomous DrivingBench2Drive | CodeCode Available | 4 | 5 |
| A deep learning framework for efficient pathology image analysis | Feb 18, 2025 | BenchmarkingDeep Learning | CodeCode Available | 4 | 5 |
| Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments | Jun 24, 2024 | Benchmarking | CodeCode Available | 4 | 5 |
| Molecular-driven Foundation Model for Oncologic Pathology | Jan 28, 2025 | BenchmarkingDiagnostic | CodeCode Available | 4 | 5 |