| HazeSpace2M: A Dataset for Haze Aware Single Image Dehazing | Sep 25, 2024 | BenchmarkingImage Dehazing | CodeCode Available | 1 |
| Benchmarking Domain Generalization Algorithms in Computational Pathology | Sep 25, 2024 | BenchmarkingData Augmentation | CodeCode Available | 0 |
| Benchmarking Deep Learning Models for Object Detection on Edge Computing Devices | Sep 25, 2024 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| SEN12-WATER: A New Dataset for Hydrological Applications and its Benchmarking | Sep 25, 2024 | BenchmarkingManagement | —Unverified | 0 |
| GSplatLoc: Grounding Keypoint Descriptors into 3D Gaussian Splatting for Improved Visual Localization | Sep 24, 2024 | 3D geometry3DGS | CodeCode Available | 2 |
| Ducho meets Elliot: Large-scale Benchmarks for Multimodal Recommendation | Sep 24, 2024 | BenchmarkingMovie Recommendation | CodeCode Available | 0 |
| Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted Data | Sep 24, 2024 | BenchmarkingDepth Estimation | CodeCode Available | 0 |
| Controlling Risk of Retrieval-augmented Generation: A Counterfactual Prompting Framework | Sep 24, 2024 | Benchmarkingcounterfactual | CodeCode Available | 0 |
| Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling | Sep 24, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| HLB: Benchmarking LLMs' Humanlikeness in Language Use | Sep 24, 2024 | Benchmarking | —Unverified | 0 |
| Small Language Models: Survey, Measurements, and Insights | Sep 24, 2024 | BenchmarkingDecoder | CodeCode Available | 2 |
| Building a continuous benchmarking ecosystem in bioinformatics | Sep 23, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Edge AI Platforms for High-Performance ML Inference | Sep 23, 2024 | BenchmarkingCPU | —Unverified | 0 |
| Boosting Healthcare LLMs Through Retrieved Context | Sep 23, 2024 | BenchmarkingMultiple-choice | CodeCode Available | 1 |
| Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images | Sep 23, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking | Sep 23, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code | Sep 23, 2024 | BenchmarkingCode Generation | CodeCode Available | 1 |
| AlphaZip: Neural Network-Enhanced Lossless Text Compression | Sep 23, 2024 | BenchmarkingData Compression | CodeCode Available | 0 |
| Margin-bounded Confidence Scores for Out-of-Distribution Detection | Sep 22, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 0 |
| Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra | Sep 22, 2024 | Benchmarking | —Unverified | 0 |
| Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance | Sep 22, 2024 | AutoMLBenchmarking | CodeCode Available | 0 |
| The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests | Sep 22, 2024 | Benchmarking | —Unverified | 0 |
| A Survey on Multimodal Benchmarks: In the Era of Large AI Models | Sep 21, 2024 | BenchmarkingSurvey | CodeCode Available | 2 |
| CONGRA: Benchmarking Automatic Conflict Resolution | Sep 21, 2024 | Benchmarking | CodeCode Available | 0 |
| @Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology | Sep 21, 2024 | BenchmarkingDepth Estimation | —Unverified | 0 |