| Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning | Dec 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering | May 22, 2025 | BenchmarkingEvidence Selection | CodeCode Available | 1 | 5 |
| 2.5D Visual Relationship Detection | Apr 26, 2021 | BenchmarkingDepth Estimation | CodeCode Available | 1 | 5 |
| Benchmarking Robustness to Adversarial Image Obfuscations | Jan 30, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Text-Image Composed Retrieval | Nov 24, 2023 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Machine Reading Comprehension Models | Apr 29, 2020 | BenchmarkingMachine Reading Comprehension | CodeCode Available | 1 | 5 |
| GEMv2: Multilingual NLG Benchmarking in a Single Line of Code | Jun 22, 2022 | BenchmarkingText Generation | CodeCode Available | 1 | 5 |
| Benchmarking saliency methods for chest X-ray interpretation | Oct 10, 2022 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset | Jun 5, 2023 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification | Jun 20, 2024 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift | Dec 15, 2022 | BenchmarkingImage Captioning | CodeCode Available | 1 | 5 |
| Benchmarking Self-Supervised Learning on Diverse Pathology Datasets | Dec 9, 2022 | BenchmarkingClassification | CodeCode Available | 1 | 5 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 | 5 |
| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | Feb 28, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 | 5 |
| GLGENN: A Novel Parameter-Light Equivariant Neural Networks Architecture Based on Clifford Geometric Algebras | Jun 11, 2025 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Simulation-Based Inference | Jan 12, 2021 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences | May 28, 2024 | BenchmarkingFeature Engineering | CodeCode Available | 1 | 5 |
| OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling | Jul 13, 2024 | BenchmarkingMath | CodeCode Available | 1 | 5 |
| Benchmarking LLMs for Political Science: A United Nations Perspective | Feb 19, 2025 | BenchmarkingDecision Making | CodeCode Available | 1 | 5 |
| Benchmarking Large Language Models on Controllable Generation under Diversified Instructions | Jan 1, 2024 | BenchmarkingInstruction Following | CodeCode Available | 1 | 5 |
| Benchmarking Recommendation, Classification, and Tracing Based on Hugging Face Knowledge Graph | May 23, 2025 | BenchmarkingManagement | CodeCode Available | 1 | 5 |
| Are Vision Language Models Ready for Clinical Diagnosis? A 3D Medical Benchmark for Tumor-centric Visual Question Answering | May 25, 2025 | AnatomyBenchmarking | CodeCode Available | 1 | 5 |
| Grad DFT: a software library for machine learning enhanced density functional theory | Sep 23, 2023 | Benchmarking | CodeCode Available | 1 | 5 |
| Benchmarking the Performance of Bayesian Optimization across Multiple Experimental Materials Science Domains | May 23, 2021 | Active LearningBayesian Optimisation | CodeCode Available | 1 | 5 |