| SafePath: Conformal Prediction for Safe LLM-Based Autonomous Navigation | May 14, 2025 | Autonomous DrivingAutonomous Navigation | —Unverified | 0 |
| Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | May 13, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| HealthBench: Evaluating Large Language Models Towards Improved Human Health | May 13, 2025 | Instruction FollowingMultiple-choice | CodeCode Available | 7 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models | May 13, 2025 | FormMultiple-choice | CodeCode Available | 0 |
| Assessing the Chemical Intelligence of Large Language Models | May 12, 2025 | Multiple-choice | CodeCode Available | 1 |
| How well do LLMs reason over tabular data, really? | May 12, 2025 | Missing ValuesMultiple-choice | —Unverified | 0 |
| Tell Me Who Your Students Are: GPT Can Generate Valid Multiple-Choice Questions When Students' (Mis)Understanding Is Hinted | May 9, 2025 | Language ModelingLanguage Modelling | —Unverified | 0 |
| Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information | May 9, 2025 | BenchmarkingForm | —Unverified | 0 |
| EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | May 7, 2025 | Multiple-choiceQuestion Answering | CodeCode Available | 2 |