| SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents | May 29, 2025 | Adversarial AttackLarge Language Model | CodeCode Available | 1 |
| MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research | May 26, 2025 | scientific discovery | CodeCode Available | 1 |
| PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration | May 21, 2025 | Large Language Modelscientific discovery | CodeCode Available | 1 |
| Benchmarking AI scientists in omics data-driven biological research | May 13, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery | Apr 23, 2025 | scientific discovery | CodeCode Available | 1 |
| The AI Cosmologist I: An Agentic System for Automated Data Analysis | Apr 4, 2025 | scientific discovery | CodeCode Available | 1 |
| Offline Model-Based Optimization: Comprehensive Review | Mar 21, 2025 | modelNeural Architecture Search | CodeCode Available | 1 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation | Feb 26, 2025 | Ingenuityscientific discovery | CodeCode Available | 1 |
| InductionBench: LLMs Fail in the Simplest Complexity Class | Feb 20, 2025 | scientific discovery | CodeCode Available | 1 |