| Profit: Benchmarking Personalization and Robustness Trade-off in Federated Prompt Tuning | Oct 6, 2023 | BenchmarkingFederated Learning | —Unverified | 0 |
| CIFAR-10-Warehouse: Broad and More Realistic Testbeds in Model Generalization Analysis | Oct 6, 2023 | BenchmarkingDomain Generalization | —Unverified | 0 |
| Bringing Quantum Algorithms to Automated Machine Learning: A Systematic Review of AutoML Frameworks Regarding Extensibility for QML Algorithms | Oct 6, 2023 | AutoMLBenchmarking | —Unverified | 0 |
| A Review of Deep Reinforcement Learning in Serverless Computing: Function Scheduling and Resource Auto-Scaling | Oct 5, 2023 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language Modeling | Oct 5, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Benchmarking a foundation LLM on its ability to re-label structure names in accordance with the AAPM TG-263 report | Oct 5, 2023 | Benchmarking | —Unverified | 0 |
| MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation | Oct 5, 2023 | BenchmarkingDecision Making | CodeCode Available | 2 |
| Deep Reinforcement Learning Algorithms for Hybrid V2X Communication: A Benchmarking Study | Oct 4, 2023 | Autonomous VehiclesBenchmarking | —Unverified | 0 |
| Can Language Models Employ the Socratic Method? Experiments with Code Debugging | Oct 4, 2023 | Benchmarking | CodeCode Available | 1 |
| Fully Automatic Segmentation of Gross Target Volume and Organs-at-Risk for Radiotherapy Planning of Nasopharyngeal Carcinoma | Oct 4, 2023 | BenchmarkingSegmentation | CodeCode Available | 0 |
| From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference | Oct 4, 2023 | BenchmarkingGPU | —Unverified | 0 |
| On the Performance of Multimodal Language Models | Oct 4, 2023 | BenchmarkingBinary Classification | —Unverified | 0 |
| T^3Bench: Benchmarking Current Progress in Text-to-3D Generation | Oct 4, 2023 | 3D GenerationBenchmarking | CodeCode Available | 3 |
| PGDQN: Preference-Guided Deep Q-Network | Oct 3, 2023 | Atari GamesBenchmarking | CodeCode Available | 1 |
| CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery | Oct 3, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| EGraFFBench: Evaluation of Equivariant Graph Neural Network Force Fields for Atomistic Simulations | Oct 3, 2023 | Atomic ForcesBenchmarking | —Unverified | 0 |
| EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods | Oct 3, 2023 | Benchmarkingtext-guided-image-editing | —Unverified | 0 |
| Benchmarking and Improving Generator-Validator Consistency of Language Models | Oct 3, 2023 | BenchmarkingInstruction Following | —Unverified | 0 |
| GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking | Oct 3, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 |
| Learning Quantum Processes with Quantum Statistical Queries | Oct 3, 2023 | BenchmarkingCryptanalysis | CodeCode Available | 0 |
| Adaptive Visual Scene Understanding: Incremental Scene Graph Generation | Oct 2, 2023 | BenchmarkingContinual Learning | CodeCode Available | 0 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |
| A New Real-World Video Dataset for the Comparison of Defogging Algorithms | Oct 2, 2023 | BenchmarkingDeblurring | —Unverified | 0 |
| NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation | Oct 2, 2023 | BenchmarkingNews Recommendation | CodeCode Available | 1 |
| TRAM: Benchmarking Temporal Reasoning for Large Language Models | Oct 2, 2023 | BenchmarkingFew-Shot Learning | —Unverified | 0 |
| CoDBench: A Critical Evaluation of Data-driven Models for Continuous Dynamical Systems | Oct 2, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 |
| RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models | Oct 1, 2023 | Benchmarking | CodeCode Available | 2 |
| Adaptive Control of an Inverted Pendulum by a Reinforcement Learning-based LQR Method | Sep 30, 2023 | BenchmarkingReinforcement Learning (RL) | —Unverified | 0 |
| The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks | Sep 30, 2023 | Benchmarking | —Unverified | 0 |
| MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data | Sep 29, 2023 | BenchmarkingContrastive Learning | CodeCode Available | 1 |
| Sarcasm in Sight and Sound: Benchmarking and Expansion to Improve Multimodal Sarcasm Detection | Sep 29, 2023 | BenchmarkingDiversity | —Unverified | 0 |
| FedAIoT: A Federated Learning Benchmark for Artificial Intelligence of Things | Sep 29, 2023 | BenchmarkingFederated Learning | CodeCode Available | 1 |
| Optimizing with Low Budgets: a Comparison on the Black-box Optimization Benchmarking Suite and OpenAI Gym | Sep 29, 2023 | Bayesian OptimizationBenchmarking | —Unverified | 0 |
| Benchmarking Collaborative Learning Methods Cost-Effectiveness for Prostate Segmentation | Sep 29, 2023 | BenchmarkingFederated Learning | —Unverified | 0 |
| Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle? | Sep 29, 2023 | BenchmarkingKnowledge Graph Completion | CodeCode Available | 1 |
| Benchmarking Cognitive Biases in Large Language Models as Evaluators | Sep 29, 2023 | BenchmarkingIn-Context Learning | CodeCode Available | 1 |
| Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors | Sep 29, 2023 | BenchmarkingComputational Efficiency | —Unverified | 0 |
| A rigorous benchmarking of methods for SARS-CoV-2 lineage abundance estimation in wastewater | Sep 29, 2023 | Benchmarking | —Unverified | 0 |
| Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts | Sep 29, 2023 | BenchmarkingDecision Making | —Unverified | 0 |
| SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation | Sep 29, 2023 | 3D Human Pose Estimation3D Human Reconstruction | CodeCode Available | 3 |
| G4SATBench: Benchmarking and Advancing SAT Solving with Graph Neural Networks | Sep 29, 2023 | Benchmarking | CodeCode Available | 1 |
| FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding | Sep 28, 2023 | BenchmarkingImage Retrieval | CodeCode Available | 1 |
| LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite | Sep 28, 2023 | Benchmarking | CodeCode Available | 1 |
| Revisiting Neural Program Smoothing for Fuzzing | Sep 28, 2023 | BenchmarkingCPU | CodeCode Available | 1 |
| Language Models as a Service: Overview of a New Paradigm and its Challenges | Sep 28, 2023 | Benchmarking | —Unverified | 0 |
| LawBench: Benchmarking Legal Knowledge of Large Language Models | Sep 28, 2023 | ArticlesBenchmarking | CodeCode Available | 2 |
| GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond | Sep 28, 2023 | Benchmarking | CodeCode Available | 2 |
| The Trickle-down Impact of Reward (In-)consistency on RLHF | Sep 28, 2023 | Benchmarking | CodeCode Available | 1 |
| OceanBench: The Sea Surface Height Edition | Sep 27, 2023 | BenchmarkingSensor Fusion | CodeCode Available | 1 |