| Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection | Jul 28, 2024 | BenchmarkingFake News Detection | —Unverified | 0 |
| On the Evaluation Consistency of Attribution-based Explanations | Jul 28, 2024 | Benchmarking | CodeCode Available | 0 |
| OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation | Jul 26, 2024 | BenchmarkingDocument AI | CodeCode Available | 1 |
| Benchmarking Dependence Measures to Prevent Shortcut Learning in Medical Imaging | Jul 26, 2024 | Benchmarking | CodeCode Available | 0 |
| VoxSim: A perceptual voice similarity dataset | Jul 26, 2024 | BenchmarkingSpeaker Recognition | CodeCode Available | 1 |
| Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems | Jul 26, 2024 | Benchmarking | —Unverified | 0 |
| AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents | Jul 26, 2024 | BenchmarkingCode Generation | CodeCode Available | 3 |
| ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks | Jul 26, 2024 | BenchmarkingModel Selection | CodeCode Available | 1 |
| SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images | Jul 25, 2024 | Benchmarking | —Unverified | 0 |
| GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy | Jul 25, 2024 | Benchmarking | —Unverified | 0 |