| FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data | Mar 7, 2025 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Removing Geometric Bias in One-Class Anomaly Detection with Adaptive Feature Perturbation | Mar 7, 2025 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Understanding the Limits of Lifelong Knowledge Editing in LLMs | Mar 7, 2025 | Benchmarkingknowledge editing | —Unverified | 0 |
| Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol | Mar 7, 2025 | BenchmarkingBug fixing | —Unverified | 0 |
| FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance | Mar 7, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Assumed Identities: Quantifying Gender Bias in Machine Translation of Gender-Ambiguous Occupational Terms | Mar 6, 2025 | BenchmarkingMachine Translation | —Unverified | 0 |
| Benchmarking Reasoning Robustness in Large Language Models | Mar 6, 2025 | BenchmarkingMath | —Unverified | 0 |
| Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets | Mar 6, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Mar 6, 2025 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models | Mar 6, 2025 | BenchmarkingContinual Learning | CodeCode Available | 0 |
| Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases | Mar 6, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions | Mar 6, 2025 | BenchmarkingHumanEval | CodeCode Available | 0 |
| Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference | Mar 6, 2025 | Benchmarking | —Unverified | 0 |
| Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges | Mar 6, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Eventprop training for efficient neuromorphic applications | Mar 6, 2025 | BenchmarkingGPU | —Unverified | 0 |
| Towards Universal Learning-based Model for Cardiac Image Reconstruction: Summary of the CMRxRecon2024 Challenge | Mar 5, 2025 | BenchmarkingImage Reconstruction | —Unverified | 0 |
| UnPuzzle: A Unified Framework for Pathology Image Analysis | Mar 5, 2025 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| GNNMerge: Merging of GNN Models Without Accessing Training Data | Mar 5, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 0 |
| AttackSeqBench: Benchmarking Large Language Models' Understanding of Sequential Patterns in Cyber Attacks | Mar 5, 2025 | Benchmarkinggraph construction | CodeCode Available | 0 |
| Benchmarking Dynamic SLO Compliance in Distributed Computing Continuum Systems | Mar 5, 2025 | BenchmarkingCPU | CodeCode Available | 0 |
| Technical report of a DMD-based Characterization Method for Vision Sensors | Mar 4, 2025 | BenchmarkingDataset Generation | —Unverified | 0 |
| Optimizing open-domain question answering with graph-based retrieval augmented generation | Mar 4, 2025 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| A2Perf: Real-World Autonomous Agents Benchmark | Mar 4, 2025 | BenchmarkingCombinatorial Optimization | —Unverified | 0 |
| Evaluation of Architectural Synthesis Using Generative AI | Mar 4, 2025 | Benchmarking | —Unverified | 0 |