| Arctique: An artificial histopathological dataset unifying realism and controllability for uncertainty quantification | Nov 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset | Nov 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks | Nov 4, 2024 | Action GenerationBenchmarking | CodeCode Available | 1 |
| LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation | Nov 4, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 |
| ROAD-Waymo: Action Awareness at Scale for Autonomous Driving | Nov 3, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| MIRFLEX: Music Information Retrieval Feature Library for Extraction | Nov 1, 2024 | BenchmarkingInformation Retrieval | CodeCode Available | 1 |
| LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models | Nov 1, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery | Oct 31, 2024 | BenchmarkingCloud Removal | CodeCode Available | 1 |
| Pedestrian Trajectory Prediction with Missing Data: Datasets, Imputation, and Benchmarking | Oct 31, 2024 | BenchmarkingImputation | CodeCode Available | 1 |
| DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios | Oct 31, 2024 | BenchmarkingLLM-generated Text Detection | CodeCode Available | 1 |
| LLM4Mat-Bench: Benchmarking Large Language Models for Materials Property Prediction | Oct 31, 2024 | BenchmarkingPrediction | CodeCode Available | 1 |
| EMGBench: Benchmarking Out-of-Distribution Generalization and Adaptation for Electromyography | Oct 31, 2024 | BenchmarkingElectromyography (EMG) | CodeCode Available | 1 |
| DataRec: A Python Library for Standardized and Reproducible Data Management in Recommender Systems | Oct 30, 2024 | BenchmarkingManagement | CodeCode Available | 1 |
| Survey of Cultural Awareness in Language Models: Text and Beyond | Oct 30, 2024 | Benchmarking | CodeCode Available | 1 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| SPICEPilot: Navigating SPICE Code Generation and Simulation with AI Guidance | Oct 27, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios | Oct 25, 2024 | BenchmarkingDiversity | CodeCode Available | 1 |
| Benchmarking Multi-Scene Fire and Smoke Detection | Oct 22, 2024 | Benchmarking | CodeCode Available | 1 |
| Comprehensive benchmarking of large language models for RNA secondary structure prediction | Oct 21, 2024 | Benchmarking | CodeCode Available | 1 |
| MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems | Oct 18, 2024 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments | Oct 18, 2024 | Autonomous NavigationBenchmarking | CodeCode Available | 1 |
| Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all | Oct 17, 2024 | AllBenchmarking | CodeCode Available | 1 |
| WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation | Oct 16, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation | Oct 15, 2024 | BenchmarkingInteractive Segmentation | CodeCode Available | 1 |
| TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models | Oct 14, 2024 | 2kBenchmarking | CodeCode Available | 1 |