| My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks | Jun 24, 2023 | BenchmarkingHate Speech Detection | —Unverified | 0 |
| MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models | Jun 23, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs | Jun 22, 2023 | Arithmetic ReasoningBenchmarking | CodeCode Available | 1 |
| OptIForest: Optimal Isolation Forest for Anomaly Detection | Jun 22, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Benchmarking and Analyzing 3D-aware Image Synthesis with a Modularized Codebase | Jun 21, 2023 | 3D-Aware Image SynthesisBenchmarking | CodeCode Available | 1 |
| GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection | Jun 21, 2023 | Anomaly DetectionBenchmarking | CodeCode Available | 1 |
| On-orbit model training for satellite imagery with label proportions | Jun 21, 2023 | BenchmarkingEarth Observation | CodeCode Available | 0 |
| On Evaluation of Document Classification using RVL-CDIP | Jun 21, 2023 | BenchmarkingClassification | —Unverified | 0 |
| VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution | Jun 21, 2023 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features | Jun 21, 2023 | BenchmarkingModel Selection | CodeCode Available | 1 |
| Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted? | Jun 21, 2023 | BenchmarkingExplainable artificial intelligence | —Unverified | 0 |
| A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking | Jun 21, 2023 | Adversarial RobustnessBenchmarking | —Unverified | 0 |
| IMP-MARL: a Suite of Environments for Large-scale Infrastructure Management Planning via MARL | Jun 20, 2023 | BenchmarkingManagement | CodeCode Available | 1 |
| Diverse Community Data for Benchmarking Data Privacy Algorithms | Jun 20, 2023 | Benchmarking | —Unverified | 0 |
| Geometric Deep Learning for Structure-Based Drug Design: A Survey | Jun 20, 2023 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction | Jun 20, 2023 | BenchmarkingDocument-level Relation Extraction | CodeCode Available | 0 |
| Beyond Normal: On the Evaluation of Mutual Information Estimators | Jun 19, 2023 | BenchmarkingDomain Generalization | CodeCode Available | 1 |
| causalAssembly: Generating Realistic Production Data for Benchmarking Causal Discovery | Jun 19, 2023 | BenchmarkingCausal Discovery | CodeCode Available | 1 |
| OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems | Jun 19, 2023 | BenchmarkingDecoder | CodeCode Available | 2 |
| Benchmarking Robustness of Deep Reinforcement Learning approaches to Online Portfolio Management | Jun 19, 2023 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| Fairness Index Measures to Evaluate Bias in Biometric Recognition | Jun 19, 2023 | BenchmarkingFairness | —Unverified | 0 |
| Using Motif Transitions for Temporal Graph Generation | Jun 19, 2023 | BenchmarkingGraph Generation | CodeCode Available | 0 |
| OpenDataVal: a Unified Benchmark for Data Valuation | Jun 18, 2023 | BenchmarkingData Valuation | CodeCode Available | 1 |
| Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls and New Benchmarking | Jun 18, 2023 | BenchmarkingLink Prediction | CodeCode Available | 1 |
| Formal Covariate Benchmarking to Bound Omitted Variable Bias | Jun 18, 2023 | BenchmarkingSensitivity | —Unverified | 0 |
| MA-BBOB: Many-Affine Combinations of BBOB Functions for Evaluating AutoML Approaches in Noiseless Numerical Black-Box Optimization Contexts | Jun 18, 2023 | AutoMLBenchmarking | —Unverified | 0 |
| CompanyKG: A Large-Scale Heterogeneous Graph for Company Similarity Quantification | Jun 18, 2023 | BenchmarkingRetrieval | CodeCode Available | 1 |
| Benchmarking Deep Learning Architectures for Urban Vegetation Point Cloud Semantic Segmentation from MLS | Jun 17, 2023 | BenchmarkingSegmentation | —Unverified | 0 |
| Framework and Benchmarks for Combinatorial and Mixed-variable Bayesian Optimization | Jun 16, 2023 | Bayesian OptimizationBenchmarking | —Unverified | 0 |
| Convolutional and Deep Learning based techniques for Time Series Ordinal Classification | Jun 16, 2023 | BenchmarkingOrdinal Classification | —Unverified | 0 |
| LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning | Jun 16, 2023 | Active LearningBenchmarking | CodeCode Available | 1 |
| ALP: Action-Aware Embodied Learning for Perception | Jun 16, 2023 | Benchmarkingobject-detection | —Unverified | 0 |
| Acoustic Identification of Ae. aegypti Mosquitoes using Smartphone Apps and Residual Convolutional Neural Networks | Jun 16, 2023 | Benchmarking | CodeCode Available | 0 |
| Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond | Jun 16, 2023 | BenchmarkingEvidence Selection | CodeCode Available | 1 |
| AQuA: A Benchmarking Tool for Label Quality Assessment | Jun 15, 2023 | BenchmarkingLabel Error Detection | CodeCode Available | 1 |
| Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials | Jun 15, 2023 | BenchmarkingComputational chemistry | CodeCode Available | 1 |
| DISC: a Dataset for Integrated Sensing and Communication in mmWave Systems | Jun 15, 2023 | Activity RecognitionBenchmarking | —Unverified | 0 |
| Large-Scale Quantum Separability Through a Reproducible Machine Learning Lens | Jun 15, 2023 | Benchmarking | —Unverified | 0 |
| FFB: A Fair Fairness Benchmark for In-Processing Group Fairness Methods | Jun 15, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| PaReprop: Fast Parallelized Reversible Backpropagation | Jun 15, 2023 | Benchmarking | CodeCode Available | 1 |
| DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning | Jun 15, 2023 | BenchmarkingConversational Question Answering | —Unverified | 0 |
| PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs | Jun 15, 2023 | Benchmarking | CodeCode Available | 2 |
| Re-Benchmarking Pool-Based Active Learning for Binary Classification | Jun 15, 2023 | Active LearningBenchmarking | CodeCode Available | 0 |
| MLonMCU: TinyML Benchmarking with Fast Retargeting | Jun 15, 2023 | Benchmarking | CodeCode Available | 1 |
| Towards Motion Forecasting with Real-World Perception Inputs: Are End-to-End Approaches Competitive? | Jun 15, 2023 | Autonomous DrivingAutonomous Vehicles | CodeCode Available | 1 |
| KoLA: Carefully Benchmarking World Knowledge of Large Language Models | Jun 15, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support | Jun 15, 2023 | BenchmarkingInformation Retrieval | CodeCode Available | 0 |
| BED: Bi-Encoder-Based Detectors for Out-of-Distribution Detection | Jun 15, 2023 | BenchmarkingOut-of-Distribution Detection | CodeCode Available | 0 |
| Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion | Jun 15, 2023 | Benchmarkingcounterfactual | —Unverified | 0 |
| Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models | Jun 15, 2023 | BenchmarkingQuestion Answering | CodeCode Available | 1 |