| A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift, Training Dynamics, Generalizability, Evaluation and Inferential Behavior | May 13, 2025 | BenchmarkingSeismic Interpretation | —Unverified | 0 |
| Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities | May 13, 2025 | automatic-speech-translationBenchmarking | —Unverified | 0 |
| ExEBench: Benchmarking Foundation Models on Extreme Earth Events | May 13, 2025 | BenchmarkingManagement | CodeCode Available | 0 |
| Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030 | May 12, 2025 | BenchmarkingEthics | —Unverified | 0 |
| The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong | May 12, 2025 | Benchmarking | —Unverified | 0 |
| PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints | May 12, 2025 | Benchmarking | —Unverified | 0 |
| FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | May 12, 2025 | 16kBenchmarking | —Unverified | 0 |
| From raw affiliations to organization identifiers | May 12, 2025 | BenchmarkingMetadata quality | CodeCode Available | 0 |
| Benchmarking Retrieval-Augmented Generation for Chemistry | May 12, 2025 | BenchmarkingRAG | —Unverified | 0 |
| Benchmarking of CPU-intensive Stream Data Processing in The Edge Computing Systems | May 12, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Benchmarking Graph Neural Networks for Document Layout Analysis in Public Affairs | May 12, 2025 | BenchmarkingDocument Layout Analysis | —Unverified | 0 |
| Optimizing Recommendations using Fine-Tuned LLMs | May 11, 2025 | BenchmarkingRecommendation Systems | —Unverified | 0 |
| Multi-Modal Explainable Medical AI Assistant for Trustworthy Human-AI Collaboration | May 11, 2025 | BenchmarkingDescriptive | —Unverified | 0 |
| From Knowledge to Reasoning: Evaluating LLMs for Ionic Liquids Research in Chemical and Biological Engineering | May 11, 2025 | BenchmarkingGeneral Knowledge | CodeCode Available | 0 |
| Contributions of the Petabyte Scale Sequence Search Codeathon toward efforts to scale sequence-based searches on SRA | May 9, 2025 | Benchmarkingscientific discovery | —Unverified | 0 |
| Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information | May 9, 2025 | BenchmarkingForm | —Unverified | 0 |
| Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy | May 9, 2025 | BenchmarkingSentiment Analysis | —Unverified | 0 |
| DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions | May 8, 2025 | Autonomous NavigationBenchmarking | CodeCode Available | 0 |
| clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations | May 8, 2025 | BenchmarkingTask-Oriented Dialogue Systems | —Unverified | 0 |
| A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge | May 8, 2025 | Benchmarking | CodeCode Available | 0 |
| Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization | May 8, 2025 | AttributeBenchmarking | —Unverified | 0 |
| Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective | May 8, 2025 | Active LearningBenchmarking | CodeCode Available | 0 |
| Autoregressive Stochastic Clock Jitter Compensation in Analog-to-Digital Converters | May 8, 2025 | Benchmarking | —Unverified | 0 |
| Software Development Life Cycle Perspective: A Survey of Benchmarks for Code Large Language Models and Agents | May 8, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection | May 8, 2025 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation | May 8, 2025 | BenchmarkingFederated Learning | —Unverified | 0 |
| Advancing and Benchmarking Personalized Tool Invocation for LLMs | May 7, 2025 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims | May 7, 2025 | Benchmarking | CodeCode Available | 0 |
| Alpha Excel Benchmark | May 7, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Traditional Machine Learning and Deep Learning Models for Fault Detection in Power Transformers | May 7, 2025 | BenchmarkingFault Detection | CodeCode Available | 0 |
| Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions? | May 7, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 0 |
| Call for Action: towards the next generation of symbolic regression benchmark | May 6, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models | May 6, 2025 | BenchmarkingImage Generation | CodeCode Available | 0 |
| Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach | May 6, 2025 | BenchmarkingEarth Observation | CodeCode Available | 0 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning | May 5, 2025 | Benchmarking | —Unverified | 0 |
| NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities | May 5, 2025 | BenchmarkingQuantization | CodeCode Available | 0 |
| Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking | May 5, 2025 | BenchmarkingPrediction | —Unverified | 0 |
| Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation | May 4, 2025 | BenchmarkingFeature Upsampling | CodeCode Available | 0 |
| Meta-Black-Box-Optimization through Offline Q-function Learning | May 4, 2025 | BenchmarkingMamba | CodeCode Available | 0 |
| Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking | May 4, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 0 |
| NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks | May 4, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 0 |
| Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing | May 3, 2025 | BenchmarkingImage Segmentation | —Unverified | 0 |
| CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture | May 3, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| BOOM: Benchmarking Out-Of-distribution Molecular Property Predictions of Machine Learning Models | May 3, 2025 | BenchmarkingHyperparameter Optimization | —Unverified | 0 |
| PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach | May 3, 2025 | BenchmarkingImage-to-Image Translation | —Unverified | 0 |
| Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey | May 3, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking | May 3, 2025 | BenchmarkingData Integration | —Unverified | 0 |
| Parameterized Argumentation-based Reasoning Tasks for Benchmarking Generative Language Models | May 2, 2025 | Benchmarking | CodeCode Available | 0 |
| Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging | May 2, 2025 | BenchmarkingComputational Efficiency | —Unverified | 0 |