| From Grounding to Planning: Benchmarking Bottlenecks in Web Agents | Sep 3, 2024 | Benchmarking | —Unverified | 0 |
| A practical generalization metric for deep networks benchmarking | Sep 2, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification | Sep 2, 2024 | Benchmarking | —Unverified | 0 |
| Towards Student Actions in Classroom Scenes: New Dataset and Baseline | Sep 2, 2024 | Action DetectionBenchmarking | CodeCode Available | 1 |
| Revisiting Safe Exploration in Safe Reinforcement learning | Sep 2, 2024 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems | Sep 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 3 |
| Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages | Sep 1, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Accelerating the discovery of steady-states of planetary interior dynamics with machine learning | Aug 30, 2024 | Benchmarking | —Unverified | 0 |
| Understanding the User: An Intent-Based Ranking Dataset | Aug 30, 2024 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists | Aug 30, 2024 | BenchmarkingSentiment Analysis | CodeCode Available | 0 |