| Machine learning classification of non-Markovian noise disturbing quantum dynamics | Jan 8, 2021 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 0 | 5 |
| Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data | Jan 31, 2024 | BenchmarkingChange Detection | CodeCode Available | 0 | 5 |
| A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 | 5 |
| Distributed Non-Convex Optimization with Sublinear Speedup under Intermittent Client Availability | Feb 18, 2020 | BenchmarkingFederated Learning | CodeCode Available | 0 | 5 |
| Flexible Generation of Preference Data for Recommendation Analysis | Jul 23, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 | 5 |
| Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI | Mar 7, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions | Aug 2, 2024 | Benchmarkingmultimodal interaction | CodeCode Available | 0 | 5 |
| Benchmarking Large Language Models for Molecule Prediction Tasks | Mar 8, 2024 | BenchmarkingPrediction | CodeCode Available | 0 | 5 |
| DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions | May 8, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 | 5 |
| Are Large Language Models Good at Utility Judgments? | Mar 28, 2024 | Answer GenerationBenchmarking | CodeCode Available | 0 | 5 |
| Benchmarking performance of object detection under image distortions in an uncontrolled environment | Oct 28, 2022 | BenchmarkingObject | CodeCode Available | 0 | 5 |
| DispaRisk: Auditing Fairness Through Usable Information | May 20, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 | 5 |
| A Framework for Evaluating PM2.5 Forecasts from the Perspective of Individual Decision Making | Sep 9, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 | 5 |
| Exploring Context Generalizability in Citywide Crowd Mobility Prediction: An Analytic Framework and Benchmark | Jun 30, 2021 | BenchmarkingPrediction | CodeCode Available | 0 | 5 |
| Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents | Jan 18, 2021 | Atari GamesBenchmarking | CodeCode Available | 0 | 5 |
| Generative Models for Fast Simulation of Cherenkov Detectors at the Electron-Ion Collider | Apr 26, 2025 | BenchmarkingGPU | CodeCode Available | 0 | 5 |
| GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking | May 24, 2023 | BenchmarkingGraph Mining | CodeCode Available | 0 | 5 |
| Exploring Model-based Planning with Policy Networks | Jun 20, 2019 | Benchmarkingmodel | CodeCode Available | 0 | 5 |
| GenderBench: Evaluation Suite for Gender Biases in LLMs | May 17, 2025 | Benchmarking | CodeCode Available | 0 | 5 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms | Jul 1, 2022 | BenchmarkingClassification | CodeCode Available | 0 | 5 |
| GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations | Jun 17, 2024 | BenchmarkingDataset Generation | CodeCode Available | 0 | 5 |
| A Recipe for CAC: Mosaic-based Generalized Loss for Improved Class-Agnostic Counting | Apr 15, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters | Jun 16, 2024 | BenchmarkingInstance Segmentation | CodeCode Available | 0 | 5 |
| Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma | Oct 4, 2023 | BenchmarkingSegmentation | CodeCode Available | 0 | 5 |