| Guidelines for Fine-grained Sentence-level Arabic Readability Annotation | Oct 11, 2024 | BenchmarkingSentence | —Unverified | 0 |
| Can we hop in general? A discussion of benchmark selection and design using the Hopper environment | Oct 11, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example | Oct 11, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks | Oct 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Enterprise Benchmarks for Large Language Model Evaluation | Oct 11, 2024 | BenchmarkingLanguage Model Evaluation | CodeCode Available | 0 |
| A Comparative Analysis on Ethical Benchmarking in Large Language Models | Oct 11, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Identifying Money Laundering Subgraphs on the Blockchain | Oct 10, 2024 | Benchmarking | CodeCode Available | 0 |
| Audio Explanation Synthesis with Generative Foundation Models | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Advocating Character Error Rate for Multilingual ASR Evaluation | Oct 9, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |