| False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims | May 7, 2025 | Benchmarking | CodeCode Available | 0 |
| Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions? | May 7, 2025 | BenchmarkingSemantic Segmentation | CodeCode Available | 0 |
| Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards | May 7, 2025 | BenchmarkingHallucination | CodeCode Available | 1 |
| RGB-Event Fusion with Self-Attention for Collision Prediction | May 7, 2025 | BenchmarkingComputational Efficiency | CodeCode Available | 1 |
| Advancing and Benchmarking Personalized Tool Invocation for LLMs | May 7, 2025 | BenchmarkingWorld Knowledge | CodeCode Available | 0 |
| Benchmarking LLMs' Swarm intelligence | May 7, 2025 | Benchmarking | CodeCode Available | 1 |
| Alpha Excel Benchmark | May 7, 2025 | Benchmarking | —Unverified | 0 |
| Call for Action: towards the next generation of symbolic regression benchmark | May 6, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models | May 6, 2025 | BenchmarkingImage Generation | CodeCode Available | 0 |
| MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks | May 6, 2025 | BenchmarkingMultiple-choice | CodeCode Available | 0 |
| Towards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach | May 6, 2025 | BenchmarkingEarth Observation | CodeCode Available | 0 |
| CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics | May 6, 2025 | Benchmarking | CodeCode Available | 1 |
| Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking | May 5, 2025 | BenchmarkingPrediction | —Unverified | 0 |
| FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models | May 5, 2025 | BenchmarkingMathematical Reasoning | CodeCode Available | 2 |
| NeuroSim V1.5: Improved Software Backbone for Benchmarking Compute-in-Memory Accelerators with Device and Circuit-level Non-idealities | May 5, 2025 | BenchmarkingQuantization | CodeCode Available | 0 |
| Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning | May 5, 2025 | Benchmarking | —Unverified | 0 |
| NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks | May 4, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 0 |
| Meta-Black-Box-Optimization through Offline Q-function Learning | May 4, 2025 | BenchmarkingMamba | CodeCode Available | 0 |
| Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation | May 4, 2025 | BenchmarkingFeature Upsampling | CodeCode Available | 0 |
| RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video | May 4, 2025 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Representation Learning of Limit Order Book: A Comprehensive Study and Benchmarking | May 4, 2025 | BenchmarkingRepresentation Learning | CodeCode Available | 0 |
| Not Every Tree Is a Forest: Benchmarking Forest Types from Satellite Remote Sensing | May 3, 2025 | BenchmarkingImage Segmentation | —Unverified | 0 |
| CMAWRNet: Multiple Adverse Weather Removal via a Unified Quaternion Neural Architecture | May 3, 2025 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| Interpretable graph-based models on multimodal biomedical data integration: A technical review and benchmarking | May 3, 2025 | BenchmarkingData Integration | —Unverified | 0 |
| PhytoSynth: Leveraging Multi-modal Generative Models for Crop Disease Data Generation with Novel Benchmarking and Prompt Engineering Approach | May 3, 2025 | BenchmarkingImage-to-Image Translation | —Unverified | 0 |