| Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT | Apr 3, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 | 5 |
| Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences | May 28, 2024 | BenchmarkingFeature Engineering | CodeCode Available | 1 | 5 |
| Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery | Mar 24, 2025 | BenchmarkingHumanitarian | CodeCode Available | 1 | 5 |
| Benchmarking Object Detectors with COCO: A New Path Forward | Mar 27, 2024 | BenchmarkingObject | CodeCode Available | 1 | 5 |
| KO codes: Inventing Nonlinear Encoding and Decoding for Reliable Wireless Communication via Deep-learning | Aug 29, 2021 | BenchmarkingDecoder | CodeCode Available | 1 | 5 |
| KoLA: Carefully Benchmarking World Knowledge of Large Language Models | Jun 15, 2023 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Can Language Models Make Fun? A Case Study in Chinese Comical Crosstalk | Jul 2, 2022 | BenchmarkingMachine Translation | CodeCode Available | 1 | 5 |
| Can Language Models Employ the Socratic Method? Experiments with Code Debugging | Oct 4, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs | Jun 22, 2023 | Arithmetic ReasoningBenchmarking | CodeCode Available | 1 | 5 |
| LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning | Jun 16, 2023 | Active LearningBenchmarking | CodeCode Available | 1 | 5 |
| Working Memory Capacity of ChatGPT: An Empirical Study | Apr 30, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models for Automated Verilog RTL Code Generation | Dec 13, 2022 | BenchmarkingCode Generation | CodeCode Available | 1 | 5 |
| Benchmarking Spatial Relationships in Text-to-Image Generation | Dec 20, 2022 | BenchmarkingImage Generation | CodeCode Available | 1 | 5 |
| CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection | Mar 12, 2025 | BenchmarkingCode Classification | CodeCode Available | 1 | 5 |
| A Reinforcement Learning Environment for Multi-Service UAV-enabled Wireless Systems | May 11, 2021 | BenchmarkingEdge-computing | CodeCode Available | 1 | 5 |
| 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding | Oct 16, 2023 | Action RecognitionBenchmarking | CodeCode Available | 1 | 5 |
| CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs | Apr 5, 2021 | BenchmarkingKnowledge Graphs | CodeCode Available | 1 | 5 |
| CAVIAR: Co-simulation of 6G Communications, 3D Scenarios and AI for Digital Twins | Jan 6, 2024 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Simulation-Based Inference | Jan 12, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| GuacaMol: Benchmarking Models for De Novo Molecular Design | Nov 22, 2018 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| Benchmarking Language Models for Code Syntax Understanding | Oct 26, 2022 | Benchmarking | CodeCode Available | 1 | 5 |
| Large Language Models for Multi-Robot Systems: A Survey | Feb 6, 2025 | Action GenerationBenchmarking | CodeCode Available | 1 | 5 |
| TextEE: Benchmark, Reevaluation, Reflections, and Future Challenges in Event Extraction | Nov 16, 2023 | BenchmarkingEvent Extraction | CodeCode Available | 1 | 5 |
| Chaos as an interpretable benchmark for forecasting and data-driven modelling | Oct 11, 2021 | BenchmarkingSymbolic Regression | CodeCode Available | 1 | 5 |
| Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine Learning | Nov 8, 2021 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 | 5 |