| Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models | Jun 19, 2024 | BenchmarkingOpen-Domain Question Answering | —Unverified | 0 | 0 |
| Benchmarking FedAvg and FedCurv for Image Classification Tasks | Mar 31, 2023 | BenchmarkingClassification | —Unverified | 0 | 0 |
| Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models | May 16, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA | Jan 29, 2024 | BenchmarkingImage Comprehension | —Unverified | 0 | 0 |
| Mukayese: Turkish NLP Strikes Back | Nov 16, 2021 | BenchmarkingLanguage Modeling | —Unverified | 0 | 0 |
| Benchmarking features from different radiomics toolkits / toolboxes using Image Biomarkers Standardization Initiative | Jun 23, 2020 | Benchmarking | —Unverified | 0 | 0 |
| Benchmarking Feature Extractors for Reinforcement Learning-Based Semiconductor Defect Localization | Nov 18, 2023 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 | 0 |
| Benchmarking Expressive Japanese Character Text-to-Speech with VITS and Style-BERT-VITS2 | May 22, 2025 | BenchmarkingDialogue Generation | —Unverified | 0 | 0 |
| Multicalibration for Confidence Scoring in LLMs | Apr 6, 2024 | BenchmarkingQuestion Answering | —Unverified | 0 | 0 |
| Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking | Jul 21, 2016 | Action RecognitionBenchmarking | —Unverified | 0 | 0 |