| From Grounding to Planning: Benchmarking Bottlenecks in Web Agents | Sep 3, 2024 | Benchmarking | —Unverified | 0 |
| A practical generalization metric for deep networks benchmarking | Sep 2, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Towards Student Actions in Classroom Scenes: New Dataset and Baseline | Sep 2, 2024 | Action DetectionBenchmarking | CodeCode Available | 1 |
| Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification | Sep 2, 2024 | Benchmarking | —Unverified | 0 |
| Revisiting Safe Exploration in Safe Reinforcement learning | Sep 2, 2024 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems | Sep 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 3 |
| Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages | Sep 1, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Accelerating the discovery of steady-states of planetary interior dynamics with machine learning | Aug 30, 2024 | Benchmarking | —Unverified | 0 |
| Understanding the User: An Intent-Based Ranking Dataset | Aug 30, 2024 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists | Aug 30, 2024 | BenchmarkingSentiment Analysis | CodeCode Available | 0 |
| STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models | Aug 29, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization | Aug 29, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models | Aug 29, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction | Aug 29, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions | Aug 28, 2024 | Benchmarking | CodeCode Available | 2 |
| Benchmarking foundation models as feature extractors for weakly-supervised computational pathology | Aug 28, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models | Aug 28, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 1 |
| Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games | Aug 28, 2024 | Atari GamesBenchmarking | —Unverified | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts | Aug 27, 2024 | BenchmarkingModel Predictive Control | —Unverified | 0 |
| BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization | Aug 27, 2024 | 3D Object DetectionBenchmarking | —Unverified | 0 |
| VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities | Aug 27, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 0 |
| Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper | Aug 27, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation | Aug 27, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting | Aug 27, 2024 | BenchmarkingDecoder | CodeCode Available | 0 |