| Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization | Nov 15, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration | Nov 14, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Combinatorial Optimization with Policy Adaptation using Latent Space Search | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime | Nov 13, 2023 | BenchmarkingCombinatorial Optimization | CodeCode Available | 1 |
| WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models | Nov 13, 2023 | BenchmarkingInstruction Following | CodeCode Available | 1 |
| Flames: Benchmarking Value Alignment of LLMs in Chinese | Nov 12, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| MultiIoT: Benchmarking Machine Learning for the Internet of Things | Nov 10, 2023 | BenchmarkingRepresentation Learning | CodeCode Available | 1 |
| CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation | Nov 10, 2023 | BenchmarkingCloud Computing | CodeCode Available | 1 |
| TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs | Nov 9, 2023 | BenchmarkingQuestion Answering | CodeCode Available | 1 |
| The PetShop Dataset -- Finding Causes of Performance Issues across Microservices | Nov 8, 2023 | Benchmarking | CodeCode Available | 1 |
| The voraus-AD Dataset for Anomaly Detection in Robot Applications | Nov 8, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts | Nov 7, 2023 | BenchmarkingMachine Translation | CodeCode Available | 1 |
| Benchmarking Geospatial Question Answering Engines using the Dataset GeoQuestions1089 | Nov 6, 2023 | BenchmarkingKnowledge Base Question Answering | CodeCode Available | 1 |
| Hopfield-Enhanced Deep Neural Networks for Artifact-Resilient Brain State Decoding | Nov 6, 2023 | BenchmarkingData Compression | CodeCode Available | 1 |
| JRDB-Traj: A Dataset and Benchmark for Trajectory Forecasting in Crowds | Nov 5, 2023 | Autonomous NavigationAutonomous Vehicles | CodeCode Available | 1 |
| Digital Typhoon: Long-term Satellite Image Dataset for the Spatio-Temporal Modeling of Tropical Cyclones | Nov 5, 2023 | Benchmarking | CodeCode Available | 1 |
| NeuroEvoBench: Benchmarking Evolutionary Optimizers for Deep Learning Applications | Nov 4, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation | Nov 4, 2023 | BenchmarkingDrug Discovery | CodeCode Available | 1 |
| Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO | Nov 2, 2023 | BenchmarkingEdge-computing | CodeCode Available | 1 |
| EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence | Nov 1, 2023 | BenchmarkingCryogenic Electron Microscopy (cryo-EM) | CodeCode Available | 1 |
| In Search of Lost Online Test-time Adaptation: A Survey | Oct 31, 2023 | BenchmarkingGPU | CodeCode Available | 1 |
| Re-evaluating Retrosynthesis Algorithms with Syntheseus | Oct 30, 2023 | BenchmarkingMulti-step retrosynthesis | CodeCode Available | 1 |
| MLFMF: Data Sets for Machine Learning for Mathematical Formalization | Oct 24, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 1 |
| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark | Oct 20, 2023 | Benchmarkingde-en | CodeCode Available | 1 |
| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 |
| OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift | Oct 19, 2023 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now | Oct 18, 2023 | Adversarial Robustness | CodeCode Available | 1 |
| FactCHD: Benchmarking Fact-Conflicting Hallucination Detection | Oct 18, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| Object-aware Inversion and Reassembly for Image Editing | Oct 18, 2023 | BenchmarkingDenoising | CodeCode Available | 1 |
| DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations | Oct 17, 2023 | BenchmarkingEmotion Recognition | CodeCode Available | 1 |
| EvalCrafter: Benchmarking and Evaluating Large Video Generation Models | Oct 17, 2023 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding | Oct 16, 2023 | Action RecognitionBenchmarking | CodeCode Available | 1 |
| Welfare Diplomacy: Benchmarking Language Model Cooperation | Oct 13, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| pose-format: Library for Viewing, Augmenting, and Handling .pose Files | Oct 13, 2023 | BenchmarkingManagement | CodeCode Available | 1 |
| "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters | Oct 13, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| Towards Evaluating Generalist Agents: An Automated Benchmark in Open World | Oct 12, 2023 | BenchmarkingDiversity | CodeCode Available | 1 |
| GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| PepMLM: Target Sequence-Conditioned Generation of Therapeutic Peptide Binders via Span Masked Language Modeling | Oct 5, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| Can Language Models Employ the Socratic Method? Experiments with Code Debugging | Oct 4, 2023 | Benchmarking | CodeCode Available | 1 |
| GNNX-BENCH: Unravelling the Utility of Perturbation-based GNN Explainers through In-depth Benchmarking | Oct 3, 2023 | Benchmarkingcounterfactual | CodeCode Available | 1 |
| CausalTime: Realistically Generated Time-series for Benchmarking of Causal Discovery | Oct 3, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| PGDQN: Preference-Guided Deep Q-Network | Oct 3, 2023 | Atari GamesBenchmarking | CodeCode Available | 1 |
| Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Oct 2, 2023 | BenchmarkingSafety Alignment | CodeCode Available | 1 |
| NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation | Oct 2, 2023 | BenchmarkingNews Recommendation | CodeCode Available | 1 |
| FELM: Benchmarking Factuality Evaluation of Large Language Models | Oct 1, 2023 | BenchmarkingMath | CodeCode Available | 1 |
| Benchmarking Cognitive Biases in Large Language Models as Evaluators | Sep 29, 2023 | BenchmarkingIn-Context Learning | CodeCode Available | 1 |