| TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine | Jun 3, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 2 |
| Scaffold Splits Overestimate Virtual Screening Performance | Jun 2, 2024 | BenchmarkingClustering | —Unverified | 0 |
| WebSuite: Systematically Evaluating Why Web Agents Fail | Jun 1, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models | Jun 1, 2024 | Benchmarking | CodeCode Available | 1 |
| On the project risk baseline: integrating aleatory uncertainty into project scheduling | May 31, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| SECURE: Benchmarking Large Language Models for Cybersecurity | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images | May 30, 2024 | AllBenchmarking | —Unverified | 0 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| CoSy: Evaluating Textual Explanations of Neurons | May 30, 2024 | Benchmarking | —Unverified | 0 |