| Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency | Apr 24, 2025 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning | Nov 29, 2024 | BenchmarkingDeepFake Detection | CodeCode Available | 1 | 5 |
| Large Scale MRI Collection and Segmentation of Cirrhotic Liver | Oct 6, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 | 5 |
| HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns | Jan 28, 2025 | Adversarial AttackBenchmarking | CodeCode Available | 1 | 5 |
| Light Field Salient Object Detection: A Review and Benchmark | Oct 10, 2020 | BenchmarkingObject | CodeCode Available | 1 | 5 |
| Benchmarking: Past, Present and Future | Aug 1, 2021 | BenchmarkingReading Comprehension | CodeCode Available | 1 | 5 |
| GUI-Robust: A Comprehensive Dataset for Testing GUI Agent Robustness in Real-World Anomalies | Jun 17, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction | Oct 31, 2024 | BenchmarkingPrediction | CodeCode Available | 1 | 5 |
| GuacaMol: Benchmarking Models for De Novo Molecular Design | Nov 22, 2018 | BenchmarkingDrug Discovery | CodeCode Available | 1 | 5 |
| Benchmarking Self-Supervised Learning on Diverse Pathology Datasets | Dec 9, 2022 | BenchmarkingClassification | CodeCode Available | 1 | 5 |