| When Graph meets Multimodal: Benchmarking on Multimodal Attributed Graphs Learning | Oct 11, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example | Oct 11, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Can we hop in general? A discussion of benchmark selection and design using the Hopper environment | Oct 11, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| uto\!L: Autonomous Evaluation of LLMs for Truth Maintenance and Reasoning Tasks | Oct 11, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Guidelines for Fine-grained Sentence-level Arabic Readability Annotation | Oct 11, 2024 | BenchmarkingSentence | —Unverified | 0 |
| Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation | Oct 11, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 1 |
| TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations | Oct 10, 2024 | BenchmarkingDecision Making | CodeCode Available | 0 |
| Identifying Money Laundering Subgraphs on the Blockchain | Oct 10, 2024 | Benchmarking | CodeCode Available | 0 |
| COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act | Oct 10, 2024 | BenchmarkingFairness | CodeCode Available | 2 |
| Benchmarking Agentic Workflow Generation | Oct 10, 2024 | Benchmarking | CodeCode Available | 2 |