| Benchmarking Object Detectors under Real-World Distribution Shifts in Satellite Imagery | Mar 24, 2025 | BenchmarkingHumanitarian | CodeCode Available | 1 |
| Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness | Mar 24, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 1 |
| GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks | Mar 23, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction | Mar 22, 2025 | BenchmarkingVideo Understanding | CodeCode Available | 1 |
| QCPINN: Quantum-Classical Physics-Informed Neural Networks for Solving PDEs | Mar 20, 2025 | BenchmarkingPhysics-informed machine learning | CodeCode Available | 1 |
| The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination | Mar 20, 2025 | BenchmarkingLarge Language Model | CodeCode Available | 1 |
| JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System | Mar 18, 2025 | BenchmarkingIn-Context Learning | CodeCode Available | 1 |
| Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos | Mar 17, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research | Mar 17, 2025 | ArticlesBenchmarking | CodeCode Available | 1 |
| GNNs as Predictors of Agentic Workflow Performances | Mar 14, 2025 | BenchmarkingPosition | CodeCode Available | 1 |