| FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging | Jun 6, 2025 | Benchmarking | CodeCode Available | 1 |
| MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark | Jun 5, 2025 | Benchmarking | CodeCode Available | 1 |
| macOSWorld: A Multilingual Interactive Benchmark for GUI Agents | Jun 4, 2025 | BenchmarkingDomain Adaptation | CodeCode Available | 1 |
| Rethinking Machine Unlearning in Image Generation Models | Jun 3, 2025 | BenchmarkingImage Generation | CodeCode Available | 1 |
| ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions | Jun 3, 2025 | BenchmarkingDiversity | CodeCode Available | 1 |
| NetPress: Dynamically Generated LLM Benchmarks for Network Applications | Jun 3, 2025 | Benchmarking | CodeCode Available | 1 |
| CODEMENV: Benchmarking Large Language Models on Code Migration | Jun 1, 2025 | Benchmarking | CodeCode Available | 1 |
| AVROBUSTBENCH: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time | May 31, 2025 | BenchmarkingTest-time Adaptation | CodeCode Available | 1 |
| ByzFL: Research Framework for Robust Federated Learning | May 30, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation | May 30, 2025 | AllBenchmarking | CodeCode Available | 1 |