| Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | May 30, 2025 | BenchmarkingCryptanalysis | —Unverified | 0 | 0 |
| MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale | Jun 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 | 0 |
| Benchmarking Large Language Models for Cyberbullying Detection in Real-World YouTube Comments | May 25, 2025 | Benchmarking | —Unverified | 0 | 0 |
| EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition | Jun 5, 2025 | BenchmarkingEmotion Recognition | —Unverified | 0 | 0 |
| What can 5.17 billion regression fits tell us about artificial models of the human visual system? | Oct 12, 2021 | Benchmarking | —Unverified | 0 | 0 |
| MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models | Jun 24, 2024 | Benchmarking | —Unverified | 0 | 0 |
| Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques | Jun 6, 2025 | BenchmarkingModel Selection | —Unverified | 0 | 0 |
| MedBrowseComp: Benchmarking Medical Deep Research and Computer Use | May 20, 2025 | Benchmarking | —Unverified | 0 | 0 |
| Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking | Dec 2, 2024 | BenchmarkingDecision Making | —Unverified | 0 | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 | 0 |