| Forecasting time series with constraints | Feb 14, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| SkyRover: A Modular Simulator for Cross-Domain Pathfinding | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit | Feb 13, 2025 | BenchmarkingEdge-computing | —Unverified | 0 |
| Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Zero-shot generation of synthetic neurosurgical data with large language models | Feb 13, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| A Survey on LLM-based News Recommender Systems | Feb 13, 2025 | BenchmarkingFairness | —Unverified | 0 |
| Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Machine learning for modelling unstructured grid data in computational physics: a review | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Handwritten Text Recognition: A Survey | Feb 12, 2025 | BenchmarkingHandwritten Text Recognition | —Unverified | 0 |
| Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors | Feb 12, 2025 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| One-Shot Federated Learning with Classifier-Free Diffusion Models | Feb 12, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem | Feb 11, 2025 | BenchmarkingDiversity | CodeCode Available | 0 |
| The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation | Feb 11, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |
| CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring | Feb 10, 2025 | Benchmarking | CodeCode Available | 0 |
| Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation | Feb 10, 2025 | Benchmarking | —Unverified | 0 |
| MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations | Feb 10, 2025 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Decoding Complexity: Intelligent Pattern Exploration with CHPDA (Context Aware Hybrid Pattern Detection Algorithm) | Feb 9, 2025 | BenchmarkingCPU | —Unverified | 0 |
| Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models | Feb 9, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| Surprise Potential as a Measure of Interactivity in Driving Scenarios | Feb 8, 2025 | Benchmarking | —Unverified | 0 |
| Mol-MoE: Training Preference-Guided Routers for Molecule Generation | Feb 8, 2025 | BenchmarkingDrug Design | CodeCode Available | 0 |
| LUND-PROBE -- LUND Prostate Radiotherapy Open Benchmarking and Evaluation dataset | Feb 6, 2025 | BenchmarkingComputed Tomography (CT) | —Unverified | 0 |
| Improving the Perturbation-Based Explanation of Deepfake Detectors Through the Use of Adversarially-Generated Samples | Feb 6, 2025 | BenchmarkingDeepFake Detection | CodeCode Available | 0 |
| Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization | Feb 6, 2025 | BenchmarkingUncertainty Quantification | —Unverified | 0 |
| Verifiable Format Control for Large Language Model Generations | Feb 6, 2025 | BenchmarkingInstruction Following | —Unverified | 0 |
| PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data | Feb 6, 2025 | BenchmarkingTime Series | CodeCode Available | 0 |
| Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs | Feb 6, 2025 | BenchmarkingEpidemiology | CodeCode Available | 0 |
| EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models | Feb 6, 2025 | BenchmarkingEmotional Intelligence | —Unverified | 0 |
| Energy & Force Regression on DFT Trajectories is Not Enough for Universal Machine Learning Interatomic Potentials | Feb 5, 2025 | Benchmarking | —Unverified | 0 |
| Optimal PMU Placement for Kalman Filtering of DAE Power System Models | Feb 5, 2025 | BenchmarkingState Estimation | —Unverified | 0 |
| xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods | Feb 5, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Time Series Forecasting Models: From Statistical Techniques to Foundation Models in Real-World Applications | Feb 5, 2025 | BenchmarkingFeature Engineering | —Unverified | 0 |
| TGB-Seq Benchmark: Challenging Temporal GNNs with Complex Sequential Dynamics | Feb 5, 2025 | BenchmarkingLink Prediction | CodeCode Available | 0 |
| MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf | Feb 5, 2025 | BenchmarkingScheduling | —Unverified | 0 |
| LadderMIL: Multiple Instance Learning with Coarse-to-Fine Self-Distillation | Feb 4, 2025 | BenchmarkingClassification | —Unverified | 0 |
| No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets | Feb 4, 2025 | AllBenchmarking | CodeCode Available | 0 |
| Evalita-LLM: Benchmarking Large Language Models on Italian | Feb 4, 2025 | BenchmarkingMultiple-choice | —Unverified | 0 |
| A comparison of translation performance between DeepL and Supertext | Feb 4, 2025 | BenchmarkingMachine Translation | CodeCode Available | 0 |
| Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models | Feb 4, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Dynamic benchmarking framework for LLM-based conversational data capture | Feb 4, 2025 | Benchmarking | —Unverified | 0 |
| MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation | Feb 3, 2025 | BenchmarkingFairness | —Unverified | 0 |
| SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering | Feb 3, 2025 | BenchmarkingCode Generation | —Unverified | 0 |
| EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools | Feb 3, 2025 | Benchmarking | —Unverified | 0 |
| Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities | Feb 3, 2025 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Learned Bayesian Cramér-Rao Bound for Unknown Measurement Models Using Score Neural Networks | Feb 2, 2025 | Benchmarking | CodeCode Available | 0 |
| True Online TD-Replan(lambda) Achieving Planning through Replaying | Jan 31, 2025 | Benchmarking | —Unverified | 0 |
| MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding | Jan 30, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency | Jan 30, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |