| LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data | Feb 13, 2025 | BenchmarkingState Space Models | CodeCode Available | 1 |
| Machine learning for modelling unstructured grid data in computational physics: a review | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| SkyRover: A Modular Simulator for Cross-Domain Pathfinding | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Handwritten Text Recognition: A Survey | Feb 12, 2025 | BenchmarkingHandwritten Text Recognition | —Unverified | 0 |
| One-Shot Federated Learning with Classifier-Free Diffusion Models | Feb 12, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance | Feb 12, 2025 | BenchmarkingLong-Context Understanding | CodeCode Available | 2 |
| Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors | Feb 12, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem | Feb 11, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation | Feb 11, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |
| Foundation Model of Electronic Medical Records for Adaptive Risk Estimation | Feb 10, 2025 | Benchmarking | CodeCode Available | 1 |
| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Accelerating Data Processing and Benchmarking of AI Models for Pathology | Feb 10, 2025 | Benchmarking | CodeCode Available | 4 |
| Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring | Feb 10, 2025 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments | Feb 10, 2025 | BenchmarkingOptical Character Recognition | CodeCode Available | 1 |
| Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm) | Feb 9, 2025 | BenchmarkingCPU | —Unverified | 0 |
| Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models | Feb 9, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Mol-MoE: Training Preference-Guided Routers for Molecule Generation | Feb 8, 2025 | BenchmarkingDrug Design | CodeCode Available | 0 |
| Surprise Potential as a Measure of Interactivity in Driving Scenarios | Feb 8, 2025 | Benchmarking | —Unverified | 0 |
| ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts | Feb 8, 2025 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 1 |
| ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks | Feb 7, 2025 | Benchmarking | CodeCode Available | 3 |
| An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks | Feb 7, 2025 | BenchmarkingMulti-agent Reinforcement Learning | CodeCode Available | 1 |
| Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound | Feb 7, 2025 | Benchmarking | CodeCode Available | 4 |
| Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs | Feb 6, 2025 | BenchmarkingEpidemiology | CodeCode Available | 0 |