| Benchmarking the rationality of AI decision making using the transitivity axiom | Feb 14, 2025 | BenchmarkingDecision Making | —Unverified | 0 |
| Forecasting time series with constraints | Feb 14, 2025 | Additive modelsBenchmarking | CodeCode Available | 0 |
| A Survey on LLM-based News Recommender Systems | Feb 13, 2025 | BenchmarkingFairness | —Unverified | 0 |
| AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit | Feb 13, 2025 | BenchmarkingEdge-computing | —Unverified | 0 |
| MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency | Feb 13, 2025 | BenchmarkingMath | —Unverified | 0 |
| Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs | Feb 13, 2025 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | Feb 13, 2025 | Benchmarking | —Unverified | 0 |
| Zero-shot generation of synthetic neurosurgical data with large language models | Feb 13, 2025 | BenchmarkingDe-identification | CodeCode Available | 0 |