| FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs | Mar 27, 2025 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System | Sep 23, 2021 | BenchmarkingResponse Generation | CodeCode Available | 1 | 5 |
| Application-Oriented Benchmarking of Quantum Generative Learning Using QUARK | Aug 8, 2023 | BenchmarkingGPU | CodeCode Available | 1 | 5 |
| Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089 | Nov 6, 2023 | BenchmarkingKnowledge Base Question Answering | CodeCode Available | 1 | 5 |
| Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension | Mar 26, 2022 | BenchmarkingQuestion Answering | CodeCode Available | 1 | 5 |
| FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things | Sep 29, 2023 | BenchmarkingFederated Learning | CodeCode Available | 1 | 5 |
| Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" | Dec 2, 2024 | BenchmarkingRepresentation Learning | CodeCode Available | 1 | 5 |
| Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark | Jun 8, 2022 | BenchmarkingExplainable Artificial Intelligence (XAI) | CodeCode Available | 1 | 5 |
| Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery | Mar 24, 2025 | BenchmarkingHumanitarian | CodeCode Available | 1 | 5 |
| AI Agents That Matter | Jul 1, 2024 | Benchmarking | CodeCode Available | 1 | 5 |