| Autonomous Microscopy Experiments through Large Language Model Agents | Dec 18, 2024 | BenchmarkingExperimental Design | CodeCode Available | 1 |
| CriticBench: Benchmarking LLMs for Critique-Correct Reasoning | Feb 22, 2024 | Benchmarking | CodeCode Available | 1 |
| Autonomous Reinforcement Learning: Formalism and Benchmarking | Dec 17, 2021 | Benchmarkingreinforcement-learning | CodeCode Available | 1 |
| CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms | Aug 2, 2021 | Benchmarkingcounterfactual | CodeCode Available | 1 |
| COVID-19 event extraction from Twitter via extractive question answering with continuous prompts | Mar 19, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 |
| Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset | Nov 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | May 8, 2025 | BenchmarkingPrompt Engineering | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks | Nov 4, 2024 | Action GenerationBenchmarking | CodeCode Available | 1 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmark of Large Language Models in Mental Health Counseling | Jun 10, 2025 | Benchmarking | CodeCode Available | 1 |