| Benchmarking Distribution Shift in Tabular Data with TableShift | Dec 10, 2023 | BenchmarkingBinary Classification | CodeCode Available | 1 |
| STREAMLINE: An Automated Machine Learning Pipeline for Biomedicine Applied to Examine the Utility of Photography-Based Phenotypes for OSA Prediction Across International Sleep Centers | Dec 9, 2023 | AnatomyAutoML | CodeCode Available | 1 |
| Benchmarking and Analysis of Unsupervised Object Segmentation from Real-world Single Images | Dec 8, 2023 | BenchmarkingObject | CodeCode Available | 1 |
| Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym | Dec 6, 2023 | BenchmarkingDecision Making | CodeCode Available | 1 |
| Let the LLMs Talk: Simulating Human-to-Human Conversational QA via Zero-Shot LLM-to-LLM Interactions | Dec 5, 2023 | BenchmarkingConversational Question Answering | CodeCode Available | 1 |
| BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models | Dec 5, 2023 | BenchmarkingVisual Question Answering | CodeCode Available | 1 |
| BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks | Dec 5, 2023 | BenchmarkingMinecraft | CodeCode Available | 1 |
| Controlgym: Large-Scale Control Environments for Benchmarking Reinforcement Learning Algorithms | Nov 30, 2023 | BenchmarkingOpenAI Gym | CodeCode Available | 1 |
| Enhancing Ligand Pose Sampling for Molecular Docking | Nov 30, 2023 | BenchmarkingMolecular Docking | CodeCode Available | 1 |
| Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation | Nov 30, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 |