| An Analysis of Model Robustness across Concurrent Distribution Shifts | Jan 8, 2025 | Benchmarking | —Unverified | 0 |
| Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates | May 28, 2025 | BenchmarkingDiversity | —Unverified | 0 |
| Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets | Apr 28, 2025 | ArticlesBenchmarking | —Unverified | 0 |
| Benchmarking a (μ+λ) Genetic Algorithm with Configurable Crossover Probability | Jun 10, 2020 | Benchmarking | —Unverified | 0 |
| Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | May 18, 2025 | BenchmarkingScene Understanding | —Unverified | 0 |
| Can Language Models Serve as Text-Based World Simulators? | Jun 10, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Benchmarking AlphaFold3's protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation | Jun 6, 2024 | BenchmarkingDrug Discovery | —Unverified | 0 |
| Evaluating Nuanced Bias in Large Language Model Free Response Answers | Jul 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Benchmarking Algorithms from Machine Learning for Low-Budget Black-Box Optimization | Sep 29, 2021 | Bayesian OptimizationBenchmarking | —Unverified | 0 |
| Can humans help BERT gain "confidence"? | Aug 31, 2023 | BenchmarkingEEG | —Unverified | 0 |