| Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset | May 17, 2024 | 16kBenchmarking | CodeCode Available | 3 | 5 |
| Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making | Oct 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 3 | 5 |
| Benchmarking LLMs via Uncertainty Quantification | Jan 23, 2024 | BenchmarkingUncertainty Quantification | CodeCode Available | 3 | 5 |
| Benchmarking Multimodal AutoML for Tabular Data with Text Fields | Nov 4, 2021 | AutoMLBenchmarking | CodeCode Available | 3 | 5 |
| AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents | Oct 31, 2024 | Benchmarking | CodeCode Available | 3 | 5 |
| Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning | Jan 26, 2023 | BenchmarkingDeep Reinforcement Learning | CodeCode Available | 3 | 5 |
| A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs | Mar 14, 2022 | BenchmarkingGraph Embedding | CodeCode Available | 3 | 5 |
| mlpack 3: a fast, flexible machine learning library | Jun 18, 2018 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 3 | 5 |
| DrivAerNet++: A Large-Scale Multimodal Car Dataset with Computational Fluid Dynamics Simulations and Deep Learning Benchmarks | Jun 13, 2024 | Benchmarking | CodeCode Available | 3 | 5 |
| A Survey on Performance Metrics for Object-Detection Algorithms | Jul 21, 2020 | BenchmarkingObject | CodeCode Available | 3 | 5 |