| Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization | Jun 21, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis | Jun 21, 2024 | AI AgentAutoML | CodeCode Available | 2 |
| Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors | Jun 21, 2024 | Adversarial DefenseAdversarial Robustness | —Unverified | 0 |
| Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video | Jun 21, 2024 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking | Jun 21, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 7 |
| CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines | Jun 20, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary | Jun 20, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| QeMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules | Jun 20, 2024 | Benchmarking | CodeCode Available | 0 |
| Selected Languages are All You Need for Cross-lingual Truthfulness Transfer | Jun 20, 2024 | AllBenchmarking | CodeCode Available | 0 |
| Beyond Optimism: Exploration With Partially Observable Rewards | Jun 20, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |