| MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data | Jun 26, 2024 | BenchmarkingMath | CodeCode Available | 2 | 5 |
| CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation | Oct 30, 2024 | BenchmarkingPassage Retrieval | CodeCode Available | 2 | 5 |
| Benchmarking Benchmark Leakage in Large Language Models | Apr 29, 2024 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 | 5 |
| Benchmarking Complex Instruction-Following with Multiple Constraints Composition | Jul 4, 2024 | BenchmarkingInstruction Following | CodeCode Available | 2 | 5 |
| Customizable Perturbation Synthesis for Robust SLAM Benchmarking | Feb 12, 2024 | BenchmarkingSimultaneous Localization and Mapping | CodeCode Available | 2 | 5 |
| MINERVA: Evaluating Complex Video Reasoning | May 1, 2025 | BenchmarkingTemporal Localization | CodeCode Available | 2 | 5 |
| EasyTPP: Towards Open Benchmarking Temporal Point Processes | Jul 16, 2023 | BenchmarkingPoint Processes | CodeCode Available | 2 | 5 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 | 5 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 | 5 |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | Jul 3, 2024 | BenchmarkingCode Search | CodeCode Available | 2 | 5 |