| A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Voice | Dec 20, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Arena-Rosnav 2.0: A Development and Benchmarking Platform for Robot Navigation in Highly Dynamic Environments | Feb 20, 2023 | BenchmarkingRobot Navigation | CodeCode Available | 0 |
| On the Fragility of Active Learners for Text Classification | Mar 23, 2024 | Active LearningBenchmarking | CodeCode Available | 0 |
| Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation | Oct 29, 2021 | BenchmarkingBrain Tumor Segmentation | CodeCode Available | 0 |
| Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset | Feb 8, 2024 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Large Language Models for Math Reasoning Tasks | Aug 20, 2024 | BenchmarkingIn-Context Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models for Image Classification of Marine Mammals | Oct 22, 2024 | Benchmarkingimage-classification | CodeCode Available | 0 |
| On the Loss of Context-awareness in General Instruction Fine-tuning | Nov 5, 2024 | BenchmarkingInstruction Following | CodeCode Available | 0 |
| HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation | May 16, 2025 | BenchmarkingEthics | CodeCode Available | 0 |
| SNaC: Coherence Error Detection for Narrative Summarization | May 19, 2022 | BenchmarkingCoherence Evaluation | CodeCode Available | 0 |
| SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services | May 29, 2025 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| Using Motif Transitions for Temporal Graph Generation | Jun 19, 2023 | BenchmarkingGraph Generation | CodeCode Available | 0 |
| Accurate Peak Detection in Multimodal Optimization via Approximated Landscape Learning | Mar 23, 2025 | Benchmarking | CodeCode Available | 0 |
| Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias | Jul 3, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams | Jun 17, 2024 | AllBenchmarking | CodeCode Available | 0 |
| Word Embeddings for the Construction Domain | Oct 28, 2016 | BenchmarkingGeneral Classification | CodeCode Available | 0 |
| What Actions are Needed for Understanding Human Actions in Videos? | Aug 9, 2017 | Benchmarking | CodeCode Available | 0 |
| ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness | Jun 1, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers | Mar 16, 2022 | Benchmarking | CodeCode Available | 0 |
| On the Use of ArXiv as a Dataset | Apr 30, 2019 | ArticlesAuthor Attribution | CodeCode Available | 0 |
| On the use of automatically generated synthetic image datasets for benchmarking face recognition | Jun 8, 2021 | BenchmarkingFace Recognition | CodeCode Available | 0 |
| Benchmarking Large Language Models for Molecule Prediction Tasks | Mar 8, 2024 | BenchmarkingPrediction | CodeCode Available | 0 |
| Accel-NASBench: Sustainable Benchmarking for Accelerator-Aware NAS | Apr 9, 2024 | BenchmarkingNeural Architecture Search | CodeCode Available | 0 |
| SoftPQ: Robust Instance Segmentation Evaluation via Soft Matching and Tunable Thresholds | May 17, 2025 | BenchmarkingBinary Classification | CodeCode Available | 0 |
| On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition | Jun 6, 2021 | BenchmarkingMemorization | CodeCode Available | 0 |