| ACCORD: Closing the Commonsense Measurability Gap | Jun 4, 2024 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability | Jun 4, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions | Jun 3, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection | Jun 3, 2024 | Action RecognitionBenchmarking | —Unverified | 0 |
| R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models | Jun 3, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| Scaffold Splits Overestimate Virtual Screening Performance | Jun 2, 2024 | BenchmarkingClustering | —Unverified | 0 |
| WebSuite: Systematically Evaluating Why Web Agents Fail | Jun 1, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| On the project risk baseline: integrating aleatory uncertainty into project scheduling | May 31, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images | May 30, 2024 | AllBenchmarking | —Unverified | 0 |
| CoSy: Evaluating Textual Explanations of Neurons | May 30, 2024 | Benchmarking | —Unverified | 0 |