| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 |
| OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift | Oct 19, 2023 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now | Oct 18, 2023 | Adversarial Robustness | CodeCode Available | 1 |
| FactCHD: Benchmarking Fact-Conflicting Hallucination Detection | Oct 18, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| Object-aware Inversion and Reassembly for Image Editing | Oct 18, 2023 | BenchmarkingDenoising | CodeCode Available | 1 |
| DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations | Oct 17, 2023 | BenchmarkingEmotion Recognition | CodeCode Available | 1 |
| EvalCrafter: Benchmarking and Evaluating Large Video Generation Models | Oct 17, 2023 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding | Oct 16, 2023 | Action RecognitionBenchmarking | CodeCode Available | 1 |
| Welfare Diplomacy: Benchmarking Language Model Cooperation | Oct 13, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| pose-format: Library for Viewing, Augmenting, and Handling .pose Files | Oct 13, 2023 | BenchmarkingManagement | CodeCode Available | 1 |
| "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters | Oct 13, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| Towards Evaluating Generalist Agents: An Automated Benchmark in Open World | Oct 12, 2023 | BenchmarkingDiversity | CodeCode Available | 1 |
| GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language Modeling | Oct 5, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Can Language Models Employ the Socratic Method? Experiments with Code Debugging | Oct 4, 2023 | Benchmarking | CodeCode Available | 1 |
| GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking | Oct 3, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 |
| CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery | Oct 3, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| PGDQN: Preference-Guided Deep Q-Network | Oct 3, 2023 | Atari GamesBenchmarking | CodeCode Available | 1 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |
| NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation | Oct 2, 2023 | BenchmarkingNews Recommendation | CodeCode Available | 1 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 |
| Benchmarking Cognitive Biases in Large Language Models as Evaluators | Sep 29, 2023 | BenchmarkingIn-Context Learning | CodeCode Available | 1 |