| Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos | Jul 12, 2024 | BenchmarkingPupil Dilation | CodeCode Available | 1 |
| Benchmarking Language Model Creativity: A Case Study on Code Generation | Jul 12, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Natural language is not enough: Benchmarking multi-modal generative AI for Verilog generation | Jul 11, 2024 | Benchmarking | CodeCode Available | 1 |
| PredBench: Benchmarking Spatio-Temporal Prediction across Diverse Disciplines | Jul 11, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| Benchmarking Embedding Aggregation Methods in Computational Pathology: A Clinical Data Perspective | Jul 10, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| Training on the Test Task Confounds Evaluation and Emergence | Jul 10, 2024 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| OpenCIL: Benchmarking Out-of-Distribution Detection in Class-Incremental Learning | Jul 8, 2024 | Benchmarkingclass-incremental learning | CodeCode Available | 1 |
| CodeUpdateArena: Benchmarking Knowledge Editing on API Updates | Jul 8, 2024 | Benchmarkingknowledge editing | CodeCode Available | 1 |
| Replication in Visual Diffusion Models: A Survey and Outlook | Jul 7, 2024 | BenchmarkingSurvey | CodeCode Available | 1 |
| Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters | Jul 5, 2024 | Benchmarkingvalid | CodeCode Available | 1 |