| Alexpaca: Learning Factual Clarification Question Generation Without Examples | Oct 17, 2023 | BenchmarkingChatbot | —Unverified | 0 |
| EvalCrafter: Benchmarking and Evaluating Large Video Generation Models | Oct 17, 2023 | BenchmarkingLanguage Modelling | CodeCode Available | 1 |
| DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations | Oct 17, 2023 | BenchmarkingEmotion Recognition | CodeCode Available | 1 |
| BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali | Oct 16, 2023 | BenchmarkingData Augmentation | —Unverified | 0 |
| An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition | Oct 16, 2023 | BenchmarkingMicro Expression Recognition | —Unverified | 0 |
| Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation | Oct 16, 2023 | BenchmarkingCoronary Artery Segmentation | —Unverified | 0 |
| 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding | Oct 16, 2023 | Action RecognitionBenchmarking | CodeCode Available | 1 |
| TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models | Oct 16, 2023 | Automated Theorem ProvingBenchmarking | CodeCode Available | 0 |
| A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction | Oct 16, 2023 | BenchmarkingPedestrian Trajectory Prediction | —Unverified | 0 |
| Prompting Scientific Names for Zero-Shot Species Recognition | Oct 15, 2023 | BenchmarkingZero-Shot Learning | —Unverified | 0 |
| Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning | Oct 15, 2023 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum Systems | Oct 14, 2023 | Benchmarking | CodeCode Available | 0 |
| Benchmarking the Sim-to-Real Gap in Cloth Manipulation | Oct 14, 2023 | BenchmarkingMuJoCo | —Unverified | 0 |
| Mirage: Model-Agnostic Graph Distillation for Graph Classification | Oct 14, 2023 | BenchmarkingClassification | CodeCode Available | 0 |
| "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters | Oct 13, 2023 | BenchmarkingFairness | CodeCode Available | 1 |
| pose-format: Library for Viewing, Augmenting, and Handling .pose Files | Oct 13, 2023 | BenchmarkingManagement | CodeCode Available | 1 |
| BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media Posts | Oct 13, 2023 | BenchmarkingSentiment Analysis | CodeCode Available | 0 |
| Welfare Diplomacy: Benchmarking Language Model Cooperation | Oct 13, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 1 |
| MetaBox: A Benchmark Platform for Meta-Black-Box Optimization with Reinforcement Learning | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts | Oct 12, 2023 | Benchmarking | CodeCode Available | 1 |
| A Benchmarking Protocol for SAR Colorization: From Regression to Deep Learning Approaches | Oct 12, 2023 | BenchmarkingColorization | —Unverified | 0 |
| Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images | Oct 12, 2023 | BenchmarkingDecoder | —Unverified | 0 |
| Who Said That? Benchmarking Social Media AI Detection | Oct 12, 2023 | BenchmarkingMisinformation | —Unverified | 0 |
| Towards Evaluating Generalist Agents: An Automated Benchmark in Open World | Oct 12, 2023 | BenchmarkingDiversity | CodeCode Available | 1 |
| Octopus: Embodied Vision-Language Programmer from Environmental Feedback | Oct 12, 2023 | BenchmarkingCode Generation | CodeCode Available | 2 |
| CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving | Oct 11, 2023 | Autonomous DrivingBenchmarking | CodeCode Available | 3 |
| Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey | Oct 11, 2023 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning | Oct 11, 2023 | BenchmarkingDiversity | —Unverified | 0 |
| Transformers for Green Semantic Communication: Less Energy, More Semantics | Oct 11, 2023 | BenchmarkingCPU | CodeCode Available | 0 |
| Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design | Oct 11, 2023 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| Risk Aware Benchmarking of Large Language Models | Oct 11, 2023 | BenchmarkingEconometrics | —Unverified | 0 |
| Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms | Oct 11, 2023 | BenchmarkingDenoising | —Unverified | 0 |
| ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons | Oct 11, 2023 | BenchmarkingPosition | CodeCode Available | 2 |
| BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep Supervision | Oct 10, 2023 | Acute Stroke Lesion SegmentationBenchmarking | CodeCode Available | 0 |
| CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods | Oct 10, 2023 | BenchmarkingPrediction | —Unverified | 0 |
| What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach | Oct 10, 2023 | BenchmarkingCode Generation | CodeCode Available | 1 |
| On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets | Oct 10, 2023 | AllBenchmarking | —Unverified | 0 |
| Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization | Oct 9, 2023 | Benchmarking | —Unverified | 0 |
| Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis | Oct 9, 2023 | BenchmarkingMultivariate Time Series Forecasting | CodeCode Available | 3 |
| Transcending the Attention Paradigm: Representation Learning from Geospatial Social Media Data | Oct 9, 2023 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| Simple GNNs with Low Rank Non-parametric Aggregators | Oct 8, 2023 | BenchmarkingNode Classification | CodeCode Available | 0 |
| Hi Guys or Hi Folks? Benchmarking Gender-Neutral Machine Translation with the GeNTE Corpus | Oct 8, 2023 | BenchmarkingMachine Translation | CodeCode Available | 0 |
| Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems | Oct 8, 2023 | Benchmarking | CodeCode Available | 0 |
| Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction | Oct 8, 2023 | BenchmarkingDecoder | —Unverified | 0 |
| FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets | Oct 7, 2023 | Benchmarkingnamed-entity-recognition | —Unverified | 0 |
| Beyond Text: A Deep Dive into Large Language Models' Ability on Understanding Graph Data | Oct 7, 2023 | Benchmarking | —Unverified | 0 |
| AKFruitYield: Modular benchmarking and video analysis software for Azure Kinect cameras for fruit size and fruit yield estimation in apple orchards | Oct 6, 2023 | Benchmarking | CodeCode Available | 0 |
| Full-scale modal testing of a Hawk T1A aircraft for benchmarking vibration-based methods | Oct 6, 2023 | BenchmarkingExperimental Design | —Unverified | 0 |
| LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation | Oct 6, 2023 | BenchmarkingMathematical Reasoning | —Unverified | 0 |