| CholecTrack20: A Multi-Perspective Tracking Dataset for Surgical Tools | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions | Jan 1, 2025 | Benchmarking | CodeCode Available | 0 |
| nnWNet: Rethinking the Use of Transformers in Biomedical Image Segmentation and Calling for a Unified Evaluation Benchmark | Jan 1, 2025 | BenchmarkingImage Segmentation | CodeCode Available | 2 |
| On the Utility of Equivariance and Symmetry Breaking in Deep Learning Architectures on Point Clouds | Jan 1, 2025 | Benchmarking | —Unverified | 0 |
| Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries | Dec 31, 2024 | BenchmarkingOut-of-Distribution Generalization | —Unverified | 0 |
| OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | Dec 31, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 4 |
| A review of faithfulness metrics for hallucination assessment in Large Language Models | Dec 31, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| AraSTEM: A Native Arabic Multiple Choice Question Benchmark for Evaluating LLMs Knowledge In STEM Subjects | Dec 31, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| Measuring Large Language Models Capacity to Annotate Journalistic Sourcing | Dec 30, 2024 | BenchmarkingEthics | —Unverified | 0 |
| TrajLearn: Trajectory Prediction Learning using Deep Generative Models | Dec 30, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |