| SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution | Jun 13, 2024 | BenchmarkingImage Super-Resolution | CodeCode Available | 1 |
| Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark | Jun 12, 2024 | BenchmarkingMixture-of-Experts | CodeCode Available | 1 |
| Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework | Jun 12, 2024 | BenchmarkingCausal Inference | CodeCode Available | 1 |
| TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | Jun 12, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |
| RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection | Jun 11, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| AudioMarkBench: Benchmarking Robustness of Audio Watermarking | Jun 11, 2024 | Benchmarkingtext-to-speech | CodeCode Available | 1 |
| EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models | Jun 9, 2024 | Benchmarking | CodeCode Available | 1 |
| Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking | Jun 9, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| ICU-Sepsis: A Benchmark MDP Built from Real Medical Data | Jun 9, 2024 | BenchmarkingManagement | CodeCode Available | 1 |
| QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation | Jun 9, 2024 | BenchmarkingQuestion Generation | CodeCode Available | 1 |
| CLoG: Benchmarking Continual Learning of Image Generation Models | Jun 7, 2024 | BenchmarkingContinual Learning | CodeCode Available | 1 |
| CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark | Jun 5, 2024 | Benchmarking | CodeCode Available | 1 |
| TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising | Jun 5, 2024 | BenchmarkingDenoising | CodeCode Available | 1 |
| CommonPower: A Framework for Safe Data-Driven Smart Grid Control | Jun 5, 2024 | Benchmarkingenergy management | CodeCode Available | 1 |
| An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders | Jun 4, 2024 | BenchmarkingClustering | CodeCode Available | 1 |
| animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics | Jun 3, 2024 | Audio ClassificationBenchmarking | CodeCode Available | 1 |
| GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models | Jun 1, 2024 | Benchmarking | CodeCode Available | 1 |
| LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| SECURE: Benchmarking Large Language Models for Cybersecurity | May 30, 2024 | Benchmarking | CodeCode Available | 1 |
| Aquatic Navigation: A Challenging Benchmark for Deep Reinforcement Learning | May 30, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 1 |
| Quantitative Certification of Bias in Large Language Models | May 29, 2024 | Benchmarking | CodeCode Available | 1 |
| MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions | May 29, 2024 | BenchmarkingDialogue Understanding | CodeCode Available | 1 |
| DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime | May 28, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 1 |
| Benchmarking Skeleton-based Motion Encoder Models for Clinical Applications: Estimating Parkinson's Disease Severity in Walking Sequences | May 28, 2024 | BenchmarkingFeature Engineering | CodeCode Available | 1 |
| Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling | May 23, 2024 | Benchmarking | CodeCode Available | 1 |
| GCondenser: Benchmarking Graph Condensation | May 23, 2024 | BenchmarkingGraph Representation Learning | CodeCode Available | 1 |
| Benchmarking Fish Dataset and Evaluation Metric in Keypoint Detection -- Towards Precise Fish Morphological Assessment in Aquaculture Breeding | May 21, 2024 | BenchmarkingKeypoint Detection | CodeCode Available | 1 |
| DocuMint: Docstring Generation for Python using Small Language Models | May 16, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| SciFIBench: Benchmarking Large Multimodal Models for Scientific Figure Interpretation | May 14, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Benchmarking Classical and Learning-Based Multibeam Point Cloud Registration | May 10, 2024 | BenchmarkingPoint Cloud Registration | CodeCode Available | 1 |
| AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets | May 7, 2024 | BenchmarkingCancer Classification | CodeCode Available | 1 |
| Position: Quo Vadis, Unsupervised Time Series Anomaly Detection? | May 4, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| ATOMMIC: An Advanced Toolbox for Multitask Medical Imaging Consistency to facilitate Artificial Intelligence applications from acquisition to analysis in Magnetic Resonance Imaging | Apr 30, 2024 | BenchmarkingImage Reconstruction | CodeCode Available | 1 |
| Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations? | Apr 29, 2024 | Answer GenerationBenchmarking | CodeCode Available | 1 |
| 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs | Apr 28, 2024 | Benchmarking | CodeCode Available | 1 |
| Multi-Stream Cellular Test-Time Adaptation of Real-Time Models Evolving in Dynamic Environments | Apr 27, 2024 | Autonomous VehiclesBenchmarking | CodeCode Available | 1 |
| Constellation Dataset: Benchmarking High-Altitude Object Detection for an Urban Intersection | Apr 25, 2024 | Benchmarkingobject-detection | CodeCode Available | 1 |
| ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction | Apr 24, 2024 | AttributeAttribute Value Extraction | CodeCode Available | 1 |
| SynthEval: A Framework for Detailed Utility and Privacy Evaluation of Tabular Synthetic Data | Apr 24, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| TAVGBench: Benchmarking Text to Audible-Video Generation | Apr 22, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| Experimental Validation of Ultrasound Beamforming with End-to-End Deep Learning for Single Plane Wave Imaging | Apr 22, 2024 | Benchmarking | CodeCode Available | 1 |
| A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models | Apr 22, 2024 | BenchmarkingWorld Knowledge | CodeCode Available | 1 |
| REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking | Apr 19, 2024 | Benchmarkingcoreference-resolution | CodeCode Available | 1 |
| How to Benchmark Vision Foundation Models for Semantic Segmentation? | Apr 18, 2024 | BenchmarkingDecoder | CodeCode Available | 1 |
| Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic Data | Apr 16, 2024 | BenchmarkingFace Recognition | CodeCode Available | 1 |
| Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations | Apr 15, 2024 | BenchmarkingBias Detection | CodeCode Available | 1 |
| A Review and Efficient Implementation of Scene Graph Generation Metrics | Apr 15, 2024 | BenchmarkingGraph Generation | CodeCode Available | 1 |
| MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems | Apr 15, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation | Apr 15, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion | Apr 14, 2024 | BenchmarkingData Augmentation | CodeCode Available | 1 |