| Beyond Benchmarks: On The False Promise of AI Regulation | Jan 26, 2025 | Benchmarking | —Unverified | 0 |
| EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning | Jan 25, 2025 | BenchmarkingEvolutionary Algorithms | CodeCode Available | 7 |
| Prompting ChatGPT for Chinese Learning as L2: A CEFR and EBCL Level Study | Jan 25, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking global optimization techniques for unmanned aerial vehicle path planning | Jan 24, 2025 | Benchmarkingglobal-optimization | —Unverified | 0 |
| MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents | Jan 24, 2025 | Benchmarking | CodeCode Available | 3 |
| Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems | Jan 24, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| The Karp Dataset | Jan 24, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video | Jan 24, 2025 | 3D ReconstructionBenchmarking | CodeCode Available | 2 |
| Enhancing Biomedical Relation Extraction with Directionality | Jan 23, 2025 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 1 |
| AEON: Adaptive Estimation of Instance-Dependent In-Distribution and Out-of-Distribution Label Noise for Robust Learning | Jan 23, 2025 | Benchmarkingimage-classification | —Unverified | 0 |
| DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale | Jan 23, 2025 | Benchmarking | —Unverified | 0 |
| You Only Crash Once v2: Perceptually Consistent Strong Features for One-Stage Domain Adaptive Detection of Space Terrain | Jan 23, 2025 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| RAG-Reward: Optimizing RAG with Reward Modeling and RLHF | Jan 22, 2025 | BenchmarkingHallucination | —Unverified | 0 |
| Leveraging LLMs to Create a Haptic Devices' Recommendation System | Jan 22, 2025 | Benchmarking | —Unverified | 0 |
| Implicit Causality-biases in humans and LLMs as a tool for benchmarking LLM discourse capabilities | Jan 22, 2025 | BenchmarkingReferring Expression | —Unverified | 0 |
| Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning | Jan 22, 2025 | Benchmarking | CodeCode Available | 0 |
| CHaRNet: Conditioned Heatmap Regression for Robust Dental Landmark Localization | Jan 22, 2025 | Benchmarkingregression | —Unverified | 0 |
| Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs) | Jan 21, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes | Jan 21, 2025 | Benchmarking | —Unverified | 0 |
| Optimally-Weighted Maximum Mean Discrepancy Framework for Continual Learning | Jan 21, 2025 | BenchmarkingContinual Learning | —Unverified | 0 |
| Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems | Jan 21, 2025 | Autonomous VehiclesBenchmarking | CodeCode Available | 0 |
| Beyond the Hype: Benchmarking LLM-Evolved Heuristics for Bin Packing | Jan 20, 2025 | BenchmarkingEvolutionary Algorithms | —Unverified | 0 |
| Algorithm Selection with Probing Trajectories: Benchmarking the Choice of Classifier Model | Jan 20, 2025 | Benchmarking | —Unverified | 0 |
| Benchmarking Large Language Models via Random Variables | Jan 20, 2025 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models | Jan 19, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |