| Understanding the Limits of Lifelong Knowledge Editing in LLMs | Mar 7, 2025 | Benchmarkingknowledge editing | —Unverified | 0 |
| FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data | Mar 7, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Benchmarking LLMs in Recommendation Tasks: A Comparative Evaluation with Conventional Recommenders | Mar 7, 2025 | BenchmarkingClick-Through Rate Prediction | —Unverified | 0 |
| Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol | Mar 7, 2025 | BenchmarkingBug fixing | —Unverified | 0 |
| FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance | Mar 7, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms | Mar 6, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |
| Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets | Mar 6, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models | Mar 6, 2025 | BenchmarkingContinual Learning | CodeCode Available | 0 |