| BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving | Mar 6, 2024 | Automated Theorem ProvingBenchmarking | —Unverified | 0 |
| Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem | Mar 6, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation | Mar 5, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering | Mar 5, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis | Mar 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 2 |
| Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground | Mar 4, 2024 | Benchmarking | —Unverified | 0 |
| Classification of the Fashion-MNIST Dataset on a Quantum Computer | Mar 4, 2024 | BenchmarkingQuantum Machine Learning | —Unverified | 0 |
| Model Lakes | Mar 4, 2024 | BenchmarkingManagement | —Unverified | 0 |
| Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks | Mar 4, 2024 | Benchmarking | CodeCode Available | 0 |