| Benchmarking Biopharmaceuticals Retrieval-Augmented Generation Evaluation | Apr 15, 2025 | BenchmarkingQuestion Answering | —Unverified | 0 |
| REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites | Apr 15, 2025 | Autonomous Web NavigationBenchmarking | CodeCode Available | 3 |
| GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR | Apr 15, 2025 | Benchmarking | —Unverified | 0 |
| HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation | Apr 15, 2025 | Benchmarkingscientific discovery | CodeCode Available | 2 |
| E2E Parking Dataset: An Open Benchmark for End-to-End Autonomous Parking | Apr 15, 2025 | BenchmarkingPosition | —Unverified | 0 |
| Mamba-Based Ensemble learning for White Blood Cell Classification | Apr 15, 2025 | BenchmarkingClassification | CodeCode Available | 0 |
| Benchmarking Next-Generation Reasoning-Focused Large Language Models in Ophthalmology: A Head-to-Head Evaluation on 5,888 Items | Apr 15, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Benchmarking Vision Language Models on German Factual Data | Apr 15, 2025 | Benchmarking | —Unverified | 0 |
| FHBench: Towards Efficient and Personalized Federated Learning for Multimodal Healthcare | Apr 15, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives | Apr 15, 2025 | Benchmarking | —Unverified | 0 |