| Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images | Sep 23, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking | Sep 23, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking Edge AI Platforms for High-Performance ML Inference | Sep 23, 2024 | BenchmarkingCPU | —Unverified | 0 |
| Building a continuous benchmarking ecosystem in bioinformatics | Sep 23, 2024 | Benchmarking | —Unverified | 0 |
| AlphaZip: Neural Network-Enhanced Lossless Text Compression | Sep 23, 2024 | BenchmarkingData Compression | CodeCode Available | 0 |
| The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests | Sep 22, 2024 | Benchmarking | —Unverified | 0 |
| Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance | Sep 22, 2024 | AutoMLBenchmarking | CodeCode Available | 0 |
| Margin-bounded Confidence Scores for Out-of-Distribution Detection | Sep 22, 2024 | Autonomous DrivingBenchmarking | CodeCode Available | 0 |
| Sketch 'n Solve: An Efficient Python Package for Large-Scale Least Squares Using Randomized Numerical Linear Algebra | Sep 22, 2024 | Benchmarking | —Unverified | 0 |
| Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators | Sep 21, 2024 | Benchmarking | CodeCode Available | 0 |