| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 |
| DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation | Jun 22, 2022 | BenchmarkingRecommendation Systems | CodeCode Available | 2 |
| EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models | Dec 11, 2023 | BenchmarkingEmotional Intelligence | CodeCode Available | 2 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 |
| CoqPilot, a plugin for LLM-based generation of proofs | Oct 25, 2024 | Benchmarking | CodeCode Available | 2 |
| Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following | Oct 21, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 |
| GenRL: Multimodal-foundation world models for generalization in embodied agents | Jun 26, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 2 |
| Commit0: Library Generation from Scratch | Dec 2, 2024 | BenchmarkingCode Generation | CodeCode Available | 2 |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | Jul 3, 2024 | BenchmarkingCode Search | CodeCode Available | 2 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | Jun 13, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Neptune: The Long Orbit to Benchmarking Long Video Understanding | Dec 12, 2024 | BenchmarkingMultimodal Reasoning | CodeCode Available | 2 |
| Craftium: An Extensible Framework for Creating Reinforcement Learning Environments | Jul 4, 2024 | BenchmarkingMinecraft | CodeCode Available | 2 |
| ClimateLearn: Benchmarking Machine Learning for Weather and Climate Modeling | Jul 4, 2023 | BenchmarkingWeather Forecasting | CodeCode Available | 2 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 |
| Octopus: Embodied Vision-Language Programmer from Environmental Feedback | Oct 12, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| Are large language models superhuman chemists? | Apr 1, 2024 | Benchmarking | CodeCode Available | 2 |
| Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception | Jun 10, 2023 | 3D Object DetectionBenchmarking | CodeCode Available | 2 |
| Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations | Jun 9, 2022 | Benchmarkingcontinuous-control | CodeCode Available | 2 |
| CausalGym: Benchmarking causal interpretability methods on linguistic tasks | Feb 19, 2024 | BenchmarkingInterpretability Techniques for Deep Learning | CodeCode Available | 2 |
| Building Normalizing Flows with Stochastic Interpolants | Sep 30, 2022 | BenchmarkingDensity Estimation | CodeCode Available | 2 |
| OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception | Mar 7, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 2 |
| BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics | Jun 13, 2024 | Benchmarking | CodeCode Available | 2 |
| Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework | Jun 23, 2020 | BenchmarkingGPU | CodeCode Available | 2 |
| A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning | Sep 26, 2023 | BenchmarkingMulti-Objective Reinforcement Learning | CodeCode Available | 2 |