| VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning | Oct 30, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| PC-Gym: Benchmark Environments For Process Control Problems | Oct 29, 2024 | BenchmarkingChemical Process | CodeCode Available | 2 |
| Image2Struct: Benchmarking Structure Extraction for Vision-Language Models | Oct 29, 2024 | Benchmarking | —Unverified | 0 |
| SS3DM: Benchmarking Street-View Surface Reconstruction with a Synthetic 3D Mesh Dataset | Oct 29, 2024 | 3D ReconstructionAutonomous Driving | —Unverified | 0 |
| AI Cyber Risk Benchmark: Automated Exploitation Capabilities | Oct 29, 2024 | BenchmarkingVulnerability Detection | —Unverified | 0 |
| Benchmarking LLM Guardrails in Handling Multilingual Toxicity | Oct 29, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking Human and Automated Prompting in the Segment Anything Model | Oct 29, 2024 | BenchmarkingImage Segmentation | CodeCode Available | 0 |
| Exploring Capabilities of Time Series Foundation Models in Building Analytics | Oct 28, 2024 | Benchmarkingenergy management | —Unverified | 0 |
| Project MPG: towards a generalized performance benchmark for LLM capabilities | Oct 28, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment | Oct 28, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |