| Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations | Mar 21, 2024 | BenchmarkingMemorization | CodeCode Available | 1 |
| Benchmarking Adversarial Patch Against Aerial Detection | Oct 30, 2022 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Data Science Agents | Feb 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 |
| CheX-GPT: Harnessing Large Language Models for Enhanced Chest X-ray Report Labeling | Jan 21, 2024 | Benchmarking | CodeCode Available | 1 |
| Benchmarking Adversarial Robustness on Image Classification | Jun 1, 2020 | Adversarial AttackAdversarial Robustness | CodeCode Available | 1 |
| CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methods | Aug 2, 2022 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| FineSurE: Fine-grained Summarization Evaluation using LLMs | Jul 1, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization | Apr 6, 2025 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| CommonPower: A Framework for Safe Data-Driven Smart Grid Control | Jun 5, 2024 | Benchmarkingenergy management | CodeCode Available | 1 |