| CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs | Sep 9, 2024 | Benchmarkingknowledge editing | —Unverified | 0 |
| A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision Making | Sep 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Insights from Benchmarking Frontier Language Models on Web App Code Generation | Sep 8, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm | Sep 6, 2024 | Benchmarkingregression | —Unverified | 0 |
| Absolute Ranking: An Essential Normalization for Benchmarking Optimization Algorithms | Sep 6, 2024 | Bayesian InferenceBenchmarking | —Unverified | 0 |
| Quantum Kernel Methods under Scrutiny: A Benchmarking Study | Sep 6, 2024 | BenchmarkingQuantum Machine Learning | —Unverified | 0 |
| PlantSeg: A Large-Scale In-the-wild Dataset for Plant Disease Segmentation | Sep 6, 2024 | Benchmarkingimage-classification | CodeCode Available | 2 |
| Question-Answering Dense Video Events | Sep 6, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 0 |
| InfraLib: Enabling Reinforcement Learning and Decision-Making for Large-Scale Infrastructure Management | Sep 5, 2024 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Prediction Accuracy & Reliability: Classification and Object Localization under Distribution Shift | Sep 5, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |