| FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance | Mar 7, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders | Mar 7, 2025 | BenchmarkingClick-Through Rate Prediction | —Unverified | 0 |
| Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation | Mar 7, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases | Mar 6, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models | Mar 6, 2025 | BenchmarkingContinual Learning | CodeCode Available | 0 |
| ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions | Mar 6, 2025 | BenchmarkingHumanEval | CodeCode Available | 0 |
| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |