| A Survey of Pathology Foundation Model: Progress and Future Directions | Apr 5, 2025 | BenchmarkingMultiple Instance Learning | CodeCode Available | 1 |
| Generative Evaluation of Complex Reasoning in Large Language Models | Apr 3, 2025 | BenchmarkingMemorization | CodeCode Available | 1 |
| BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Apr 2, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 1 |
| SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers | Mar 31, 2025 | Benchmarking | CodeCode Available | 1 |
| EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos | Mar 28, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 |
| A Comprehensive Benchmark for RNA 3D Structure-Function Modeling | Mar 27, 2025 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | Mar 25, 2025 | BenchmarkingScene Segmentation | CodeCode Available | 1 |
| NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios | Mar 25, 2025 | BenchmarkingOffline RL | CodeCode Available | 1 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models | Mar 25, 2025 | BenchmarkingImage Captioning | CodeCode Available | 1 |