| Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests | Oct 31, 2023 | Benchmarking | —Unverified | 0 |
| A Metadata-Driven Approach to Understand Graph Neural Networks | Oct 30, 2023 | BenchmarkingGraph Learning | —Unverified | 0 |
| Domain Generalization in Computational Pathology: Survey and Guidelines | Oct 30, 2023 | BenchmarkingDiagnostic | —Unverified | 0 |
| LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection | Oct 29, 2023 | BenchmarkingDiversity | —Unverified | 0 |
| Evaluating LLP Methods: Challenges and Approaches | Oct 29, 2023 | BenchmarkingModel Selection | CodeCode Available | 0 |
| Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness | Oct 28, 2023 | Benchmarkingimage-classification | CodeCode Available | 0 |
| On General Language Understanding | Oct 27, 2023 | BenchmarkingEthics | —Unverified | 0 |
| OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression | Oct 27, 2023 | BenchmarkingGPU | CodeCode Available | 0 |
| OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User | Oct 26, 2023 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| RDBench: ML Benchmark for Relational Databases | Oct 25, 2023 | Benchmarking | —Unverified | 0 |
| ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair | Oct 25, 2023 | BenchmarkingFault localization | —Unverified | 0 |
| XFEVER: Exploring Fact Verification across Languages | Oct 25, 2023 | BenchmarkingFact Verification | CodeCode Available | 0 |
| Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting | Oct 25, 2023 | BenchmarkingHyperparameter Optimization | —Unverified | 0 |
| BLESS: Benchmarking Large Language Models on Sentence Simplification | Oct 24, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic | Oct 23, 2023 | BenchmarkingInstruction Following | —Unverified | 0 |
| XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification | Oct 23, 2023 | BenchmarkingTime Series | CodeCode Available | 0 |
| DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design | Oct 23, 2023 | BenchmarkingImage Generation | CodeCode Available | 0 |
| A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video | Oct 22, 2023 | 3D ReconstructionAnatomy | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| Benchmarking and Improving Text-to-SQL Generation under Ambiguity | Oct 20, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models | Oct 20, 2023 | Activity PredictionBenchmarking | CodeCode Available | 0 |
| Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package | Oct 20, 2023 | Benchmarking | —Unverified | 0 |
| Almost Equivariance via Lie Algebra Convolutions | Oct 19, 2023 | Benchmarking | —Unverified | 0 |
| Benchmarking GPUs on SVBRDF Extractor Model | Oct 19, 2023 | BenchmarkingGPU | —Unverified | 0 |
| InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions | Oct 18, 2023 | BenchmarkingVisual Grounding | CodeCode Available | 0 |
| Alexpaca: Learning Factual Clarification Question Generation Without Examples | Oct 17, 2023 | BenchmarkingChatbot | —Unverified | 0 |
| BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali | Oct 16, 2023 | BenchmarkingData Augmentation | —Unverified | 0 |
| A Novel Benchmarking Paradigm and a Scale- and Motion-Aware Model for Egocentric Pedestrian Trajectory Prediction | Oct 16, 2023 | BenchmarkingPedestrian Trajectory Prediction | —Unverified | 0 |
| An Empirical Study of Super-resolution on Low-resolution Micro-expression Recognition | Oct 16, 2023 | BenchmarkingMicro Expression Recognition | —Unverified | 0 |
| TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models | Oct 16, 2023 | Automated Theorem ProvingBenchmarking | CodeCode Available | 0 |
| Assessing Encoder-Decoder Architectures for Robust Coronary Artery Segmentation | Oct 16, 2023 | BenchmarkingCoronary Artery Segmentation | —Unverified | 0 |
| Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning | Oct 15, 2023 | BenchmarkingSpatial Reasoning | —Unverified | 0 |
| Prompting Scientific Names for Zero-Shot Species Recognition | Oct 15, 2023 | BenchmarkingZero-Shot Learning | —Unverified | 0 |
| Benchmarking the Sim-to-Real Gap in Cloth Manipulation | Oct 14, 2023 | BenchmarkingMuJoCo | —Unverified | 0 |
| Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum Systems | Oct 14, 2023 | Benchmarking | CodeCode Available | 0 |
| Mirage: Model-Agnostic Graph Distillation for Graph Classification | Oct 14, 2023 | BenchmarkingClassification | CodeCode Available | 0 |
| BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models for Sentiment Analysis of Bangla Social Media Posts | Oct 13, 2023 | BenchmarkingSentiment Analysis | CodeCode Available | 0 |
| A Benchmarking Protocol for SAR Colorization: From Regression to Deep Learning Approaches | Oct 12, 2023 | BenchmarkingColorization | —Unverified | 0 |
| Who Said That? Benchmarking Social Media AI Detection | Oct 12, 2023 | BenchmarkingMisinformation | —Unverified | 0 |
| Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images | Oct 12, 2023 | BenchmarkingDecoder | —Unverified | 0 |
| Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms | Oct 11, 2023 | BenchmarkingDenoising | —Unverified | 0 |
| Deep Reinforcement Learning for Autonomous Cyber Defence: A Survey | Oct 11, 2023 | BenchmarkingDeep Reinforcement Learning | —Unverified | 0 |
| Risk Aware Benchmarking of Large Language Models | Oct 11, 2023 | BenchmarkingEconometrics | —Unverified | 0 |
| Transformers for Green Semantic Communication: Less Energy, More Semantics | Oct 11, 2023 | BenchmarkingCPU | CodeCode Available | 0 |
| FedSym: Unleashing the Power of Entropy for Benchmarking the Algorithms for Federated Learning | Oct 11, 2023 | BenchmarkingDiversity | —Unverified | 0 |
| Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design | Oct 11, 2023 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| BeSt-LeS: Benchmarking Stroke Lesion Segmentation using Deep Supervision | Oct 10, 2023 | Acute Stroke Lesion SegmentationBenchmarking | CodeCode Available | 0 |
| On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets | Oct 10, 2023 | AllBenchmarking | —Unverified | 0 |
| CAFA-evaluator: A Python Tool for Benchmarking Ontological Classification Methods | Oct 10, 2023 | BenchmarkingPrediction | —Unverified | 0 |
| Distributed Evolution Strategies with Multi-Level Learning for Large-Scale Black-Box Optimization | Oct 9, 2023 | Benchmarking | —Unverified | 0 |