| Conditional diffusions for amortized neural posterior estimation | Oct 24, 2024 | Bayesian InferenceBenchmarking | CodeCode Available | 0 |
| Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework | Oct 24, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems | Oct 24, 2024 | BenchmarkingCommon Sense Reasoning | —Unverified | 0 |
| FuzzWiz -- Fuzzing Framework for Efficient Hardware Coverage | Oct 23, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Foundation Models on Exceptional Cases: Dataset Creation and Validation | Oct 23, 2024 | ArticlesBenchmarking | CodeCode Available | 0 |
| Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling | Oct 23, 2024 | Benchmarking | —Unverified | 0 |
| Safe Load Balancing in Software-Defined-Networking | Oct 22, 2024 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies | Oct 22, 2024 | Benchmarkingcontinuous-control | —Unverified | 0 |
| Polyp-E: Benchmarking the Robustness of Deep Segmentation Models via Polyp Editing | Oct 22, 2024 | AttributeBenchmarking | —Unverified | 0 |
| ISImed: A Framework for Self-Supervised Learning using Intrinsic Spatial Information in Medical Images | Oct 22, 2024 | BenchmarkingSelf-Supervised Learning | CodeCode Available | 0 |
| Benchmarking Large Language Models for Image Classification of Marine Mammals | Oct 22, 2024 | Benchmarkingimage-classification | CodeCode Available | 0 |
| Building Conformal Prediction Intervals with Approximate Message Passing | Oct 21, 2024 | BenchmarkingConformal Prediction | CodeCode Available | 0 |
| Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios | Oct 21, 2024 | BenchmarkingFew-Shot Learning | CodeCode Available | 0 |
| Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| A Framework for Evaluating Predictive Models Using Synthetic Image Covariates and Longitudinal Data | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping | Oct 21, 2024 | Benchmarking | —Unverified | 0 |
| Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence | Oct 20, 2024 | Benchmarking | —Unverified | 0 |
| FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning | Oct 19, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| Advancing Histopathology with Deep Learning Under Data Scarcity: A Decade in Review | Oct 18, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs | Oct 18, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Trust but Verify: Programmatic VLM Evaluation in the Wild | Oct 17, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs | Oct 17, 2024 | Benchmarking | CodeCode Available | 0 |
| Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p | Oct 17, 2024 | Benchmarkingregression | CodeCode Available | 0 |
| debiaSAE: Benchmarking and Mitigating Vision-Language Model Bias | Oct 17, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models | Oct 17, 2024 | Benchmarking | CodeCode Available | 0 |