| AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents | May 23, 2024 | Benchmarking | CodeCode Available | 4 |
| Aequitas Flow: Streamlining Fair ML Experimentation | May 9, 2024 | BenchmarkingFairness | CodeCode Available | 4 |
| LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit | May 9, 2024 | BenchmarkingComputational Efficiency | CodeCode Available | 4 |
| Benchmarking Retrieval-Augmented Generation for Medicine | Feb 20, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 4 |
| I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench | Jan 31, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 4 |
| Pearl: A Production-ready Reinforcement Learning Agent | Dec 6, 2023 | Benchmarkingreinforcement-learning | CodeCode Available | 4 |
| Benchmarking Neural Network Training Algorithms | Jun 12, 2023 | Benchmarking | CodeCode Available | 4 |
| OpenAGI: When LLM Meets Domain Experts | Apr 10, 2023 | BenchmarkingNatural Language Queries | CodeCode Available | 4 |
| Vision-Language Models for Vision Tasks: A Survey | Apr 3, 2023 | BenchmarkingKnowledge Distillation | CodeCode Available | 4 |
| MTEB: Massive Text Embedding Benchmark | Oct 13, 2022 | BenchmarkingInformation Retrieval | CodeCode Available | 4 |