| Do LLM Evaluators Prefer Themselves for a Reason? | Apr 4, 2025 | BenchmarkingCode Generation | CodeCode Available | 0 | 5 |
| Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning | Jan 22, 2025 | Benchmarking | CodeCode Available | 0 | 5 |
| Benchmarking Large Language Models on Communicative Medical Coaching: a Novel System and Dataset | Feb 8, 2024 | Benchmarking | CodeCode Available | 0 | 5 |
| Flexible Generation of Preference Data for Recommendation Analysis | Jul 23, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 | 5 |
| HATE-ITA: New Baselines for Hate Speech Detection in Italian | Jul 1, 2022 | BenchmarkingHate Speech Detection | CodeCode Available | 0 | 5 |
| Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization | Aug 29, 2024 | BenchmarkingDiversity | CodeCode Available | 0 | 5 |
| Evaluating Shallow and Deep Neural Networks for Network Intrusion Detection Systems in Cyber Security | Oct 8, 2018 | BenchmarkingBIG-bench Machine Learning | CodeCode Available | 0 | 5 |
| Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses | May 19, 2023 | BenchmarkingForm | CodeCode Available | 0 | 5 |
| Strong and Simple Baselines for Multimodal Utterance Embeddings | May 14, 2019 | Benchmarking | CodeCode Available | 0 | 5 |
| GenCeption: Evaluate Multimodal LLMs with Unlabeled Unimodal Data | Feb 22, 2024 | Benchmarking | CodeCode Available | 0 | 5 |