| Grounded Intuition of GPT-Vision's Abilities with Scientific Images | Nov 3, 2023 | Benchmarkingcounterfactual | CodeCode Available | 0 |
| An Empirical Study of Benchmarking Chinese Aspect Sentiment Quad Prediction | Nov 3, 2023 | BenchmarkingSentence | —Unverified | 0 |
| Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval | Nov 3, 2023 | BenchmarkingFairness | CodeCode Available | 0 |
| Decentralized Federated Learning on the Edge over Wireless Mesh Networks | Nov 2, 2023 | BenchmarkingFederated Learning | —Unverified | 0 |
| Replicable Benchmarking of Neural Machine Translation (NMT) on Low-Resource Local Languages in Indonesia | Nov 2, 2023 | BenchmarkingMachine Translation | CodeCode Available | 0 |
| Ultra-Efficient On-Device Object Detection on AI-Integrated Smart Glasses with TinyissimoYOLO | Nov 2, 2023 | BenchmarkingEdge-computing | CodeCode Available | 1 |
| EMPOT: partial alignment of density maps and rigid body fitting using unbalanced Gromov-Wasserstein divergence | Nov 1, 2023 | BenchmarkingCryogenic Electron Microscopy (cryo-EM) | CodeCode Available | 1 |
| Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs | Nov 1, 2023 | BenchmarkingQuestion Answering | —Unverified | 0 |
| SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization | Nov 1, 2023 | Benchmarkingreinforcement-learning | —Unverified | 0 |
| UAV Immersive Video Streaming: A Comprehensive Survey, Benchmarking, and Open Challenges | Oct 31, 2023 | Benchmarking | —Unverified | 0 |
| A Two-Step Framework for Multi-Material Decomposition of Dual Energy Computed Tomography from Projection Domain | Oct 31, 2023 | BenchmarkingDiagnostic | —Unverified | 0 |
| Next-generation MRD assays: do we have the tools to evaluate them properly? | Oct 31, 2023 | BenchmarkingSensitivity | —Unverified | 0 |
| In Search of Lost Online Test-time Adaptation: A Survey | Oct 31, 2023 | BenchmarkingGPU | CodeCode Available | 1 |
| What's In My Big Data? | Oct 31, 2023 | Benchmarking | CodeCode Available | 2 |
| Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests | Oct 31, 2023 | Benchmarking | —Unverified | 0 |
| Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks | Oct 30, 2023 | Benchmarkingobject-detection | CodeCode Available | 2 |
| Domain Generalization in Computational Pathology: Survey and Guidelines | Oct 30, 2023 | BenchmarkingDiagnostic | —Unverified | 0 |
| A Metadata-Driven Approach to Understand Graph Neural Networks | Oct 30, 2023 | BenchmarkingGraph Learning | —Unverified | 0 |
| Re-evaluating Retrosynthesis Algorithms with Syntheseus | Oct 30, 2023 | BenchmarkingMulti-step retrosynthesis | CodeCode Available | 1 |
| LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection | Oct 29, 2023 | BenchmarkingDiversity | —Unverified | 0 |
| Evaluating LLP Methods: Challenges and Approaches | Oct 29, 2023 | BenchmarkingModel Selection | CodeCode Available | 0 |
| Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness | Oct 28, 2023 | Benchmarkingimage-classification | CodeCode Available | 0 |
| OpenDMC: An Open-Source Library and Performance Evaluation for Deep-learning-based Multi-frame Compression | Oct 27, 2023 | BenchmarkingGPU | CodeCode Available | 0 |
| On General Language Understanding | Oct 27, 2023 | BenchmarkingEthics | —Unverified | 0 |
| OrionBench: Benchmarking Time Series Generative Models in the Service of the End-User | Oct 26, 2023 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting | Oct 25, 2023 | BenchmarkingHyperparameter Optimization | —Unverified | 0 |
| RDBench: ML Benchmark for Relational Databases | Oct 25, 2023 | Benchmarking | —Unverified | 0 |
| ConDefects: A New Dataset to Address the Data Leakage Concern for LLM-based Fault Localization and Program Repair | Oct 25, 2023 | BenchmarkingFault localization | —Unverified | 0 |
| XFEVER: Exploring Fact Verification across Languages | Oct 25, 2023 | BenchmarkingFact Verification | CodeCode Available | 0 |
| MLFMF: Data Sets for Machine Learning for Mathematical Formalization | Oct 24, 2023 | BenchmarkingRecommendation Systems | CodeCode Available | 1 |
| BLESS: Benchmarking Large Language Models on Sentence Simplification | Oct 24, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks | Oct 23, 2023 | Benchmarking | CodeCode Available | 1 |
| Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic | Oct 23, 2023 | BenchmarkingInstruction Following | —Unverified | 0 |
| DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design | Oct 23, 2023 | BenchmarkingImage Generation | CodeCode Available | 0 |
| XTSC-Bench: Quantitative Benchmarking for Explainers on Time Series Classification | Oct 23, 2023 | BenchmarkingTime Series | CodeCode Available | 0 |
| A Quantitative Evaluation of Dense 3D Reconstruction of Sinus Anatomy from Monocular Endoscopic Video | Oct 22, 2023 | 3D ReconstructionAnatomy | —Unverified | 0 |
| MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation | Oct 21, 2023 | BenchmarkingLanguage Model Evaluation | —Unverified | 0 |
| Fast hyperboloid decision tree algorithms | Oct 20, 2023 | BenchmarkingRiemannian optimization | CodeCode Available | 1 |
| Benchmarking and Improving Text-to-SQL Generation under Ambiguity | Oct 20, 2023 | BenchmarkingDiversity | CodeCode Available | 0 |
| Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models | Oct 20, 2023 | Activity PredictionBenchmarking | CodeCode Available | 0 |
| MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark | Oct 20, 2023 | Benchmarkingde-en | CodeCode Available | 1 |
| Standardised workflow for mass spectrometry-based single-cell proteomics data processing and analysis using the scp package | Oct 20, 2023 | Benchmarking | —Unverified | 0 |
| Benchmarking GPUs on SVBRDF Extractor Model | Oct 19, 2023 | BenchmarkingGPU | —Unverified | 0 |
| Almost Equivariance via Lie Algebra Convolutions | Oct 19, 2023 | Benchmarking | —Unverified | 0 |
| OODRobustBench: a Benchmark and Large-Scale Analysis of Adversarial Robustness under Distribution Shift | Oct 19, 2023 | Adversarial RobustnessBenchmarking | CodeCode Available | 1 |
| Formalizing and Benchmarking Prompt Injection Attacks and Defenses | Oct 19, 2023 | Benchmarking | CodeCode Available | 2 |
| FactCHD: Benchmarking Fact-Conflicting Hallucination Detection | Oct 18, 2023 | BenchmarkingHallucination | CodeCode Available | 1 |
| InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions | Oct 18, 2023 | BenchmarkingVisual Grounding | CodeCode Available | 0 |
| To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now | Oct 18, 2023 | Adversarial Robustness | CodeCode Available | 1 |
| Object-aware Inversion and Reassembly for Image Editing | Oct 18, 2023 | BenchmarkingDenoising | CodeCode Available | 1 |