| DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning | Mar 9, 2025 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation | Mar 9, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| DependEval: Benchmarking LLMs for Repository Dependency Understanding | Mar 9, 2025 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models | Mar 9, 2025 | Benchmarking | —Unverified | 0 |
| Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems | Mar 9, 2025 | Benchmarking | —Unverified | 0 |
| General Scales Unlock AI Evaluation with Explanatory and Predictive Power | Mar 9, 2025 | BenchmarkingSpecificity | —Unverified | 0 |
| Removing Multiple Hybrid Adverse Weather in Video via a Unified Model | Mar 8, 2025 | BenchmarkingVideo Restoration | —Unverified | 0 |
| UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces | Mar 8, 2025 | Benchmarkingcounterfactual | —Unverified | 0 |
| SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios | Mar 8, 2025 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| Understanding the Limits of Lifelong Knowledge Editing in LLMs | Mar 7, 2025 | Benchmarkingknowledge editing | —Unverified | 0 |