| From Grounding to Planning: Benchmarking Bottlenecks in Web Agents | Sep 3, 2024 | Benchmarking | —Unverified | 0 |
| A practical generalization metric for deep networks benchmarking | Sep 2, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| Revisiting Safe Exploration in Safe Reinforcement learning | Sep 2, 2024 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| Landscape-Aware Automated Algorithm Configuration using Multi-output Mixed Regression and Classification | Sep 2, 2024 | Benchmarking | —Unverified | 0 |
| Towards Student Actions in Classroom Scenes: New Dataset and Baseline | Sep 2, 2024 | Action DetectionBenchmarking | CodeCode Available | 1 |
| ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems | Sep 2, 2024 | BenchmarkingInstruction Following | CodeCode Available | 3 |
| Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages | Sep 1, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Accelerating the discovery of steady-states of planetary interior dynamics with machine learning | Aug 30, 2024 | Benchmarking | —Unverified | 0 |
| SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists | Aug 30, 2024 | BenchmarkingSentiment Analysis | CodeCode Available | 0 |
| Understanding the User: An Intent-Based Ranking Dataset | Aug 30, 2024 | BenchmarkingInformation Retrieval | —Unverified | 0 |
| STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models | Aug 29, 2024 | BenchmarkingImage Generation | CodeCode Available | 1 |
| Illuminating the Diversity-Fitness Trade-Off in Black-Box Optimization | Aug 29, 2024 | BenchmarkingDiversity | CodeCode Available | 0 |
| How Well Do LLMs Handle Cantonese? Benchmarking Cantonese Capabilities of Large Language Models | Aug 29, 2024 | BenchmarkingGeneral Knowledge | CodeCode Available | 1 |
| Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction | Aug 29, 2024 | Automatic Speech RecognitionAutomatic Speech Recognition (ASR) | —Unverified | 0 |
| Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions | Aug 28, 2024 | Benchmarking | CodeCode Available | 2 |
| Benchmarking foundation models as feature extractors for weakly-supervised computational pathology | Aug 28, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models | Aug 28, 2024 | BenchmarkingLogical Reasoning | CodeCode Available | 1 |
| Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games | Aug 28, 2024 | Atari GamesBenchmarking | —Unverified | 0 |
| Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis | Aug 27, 2024 | BenchmarkingLarge Language Model | —Unverified | 0 |
| Applications in CityLearn Gym Environment for Multi-Objective Control Benchmarking in Grid-Interactive Buildings and Districts | Aug 27, 2024 | BenchmarkingModel Predictive Control | —Unverified | 0 |
| FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting | Aug 27, 2024 | BenchmarkingDecoder | CodeCode Available | 0 |
| Cross-subject Brain Functional Connectivity Analysis for Multi-task Cognitive State Evaluation | Aug 27, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization | Aug 27, 2024 | 3D Object DetectionBenchmarking | —Unverified | 0 |
| Benchmarking Reinforcement Learning Methods for Dexterous Robotic Manipulation with a Three-Fingered Gripper | Aug 27, 2024 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| VHAKG: A Multi-modal Knowledge Graph Based on Synchronized Multi-view Videos of Daily Activities | Aug 27, 2024 | BenchmarkingKnowledge Graphs | CodeCode Available | 0 |
| Comparative Analysis: Violence Recognition from Videos using Transfer Learning | Aug 26, 2024 | Action RecognitionBenchmarking | CodeCode Available | 0 |
| Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study | Aug 26, 2024 | 8kBenchmarking | —Unverified | 0 |
| K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences | Aug 26, 2024 | Benchmarking | —Unverified | 0 |
| DHP Benchmark: Are LLMs Good NLG Evaluators? | Aug 25, 2024 | Benchmarkingnlg evaluation | —Unverified | 0 |
| Data Augmentation for Continual RL via Adversarial Gradient Episodic Memory | Aug 24, 2024 | BenchmarkingData Augmentation | —Unverified | 0 |
| Variational Autoencoder for Anomaly Detection: A Comparative Study | Aug 24, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| No Dataset Needed for Downstream Knowledge Benchmarking: Response Dispersion Inversely Correlates with Accuracy on Domain-specific QA | Aug 24, 2024 | BenchmarkingChatbot | —Unverified | 0 |
| Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection | Aug 23, 2024 | BenchmarkingBinary Classification | —Unverified | 0 |
| S3Simulator: A benchmarking Side Scan Sonar Simulator dataset for Underwater Image Analysis | Aug 23, 2024 | Benchmarking | CodeCode Available | 0 |
| Open Llama2 Model for the Lithuanian Language | Aug 23, 2024 | Benchmarkingmodel | —Unverified | 0 |
| Benchmarking Counterfactual Interpretability in Deep Learning Models for Time Series Classification | Aug 22, 2024 | Benchmarkingcounterfactual | —Unverified | 0 |
| MultiMed: Massively Multimodal and Multitask Medical Understanding | Aug 22, 2024 | BenchmarkingMedical Question Answering | —Unverified | 0 |
| Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis | Aug 22, 2024 | Benchmarking | —Unverified | 0 |
| Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets | Aug 22, 2024 | AllBenchmarking | CodeCode Available | 1 |
| Dynamic PDB: A New Dataset and a SE(3) Model Extension by Integrating Dynamic Behaviors and Physical Properties in Protein Structures | Aug 22, 2024 | BenchmarkingTrajectory Prediction | —Unverified | 0 |
| WCEbleedGen: A wireless capsule endoscopy dataset and its benchmarking for automatic bleeding classification, detection, and segmentation | Aug 22, 2024 | BenchmarkingClassification | CodeCode Available | 0 |
| Advances in Preference-based Reinforcement Learning: A Review | Aug 21, 2024 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins | Aug 21, 2024 | Benchmarking | CodeCode Available | 0 |
| WeQA: A Benchmark for Retrieval Augmented Generation in Wind Energy Domain | Aug 21, 2024 | Answer GenerationBenchmarking | —Unverified | 0 |
| ISLES'24: Improving final infarct prediction in ischemic stroke using multimodal imaging and clinical data | Aug 20, 2024 | Benchmarking | —Unverified | 0 |
| UKAN: Unbound Kolmogorov-Arnold Network Accompanied with Accelerated Library | Aug 20, 2024 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| Benchmarking Large Language Models for Math Reasoning Tasks | Aug 20, 2024 | BenchmarkingIn-Context Learning | CodeCode Available | 0 |
| PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis | Aug 20, 2024 | Benchmarking | CodeCode Available | 2 |
| RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands | Aug 20, 2024 | BenchmarkingContact-rich Manipulation | —Unverified | 0 |
| QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning | Aug 20, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |