| Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking | Jun 6, 2024 | 6D Pose Estimation using RGBBenchmarking | —Unverified | 0 |
| Time Sensitive Knowledge Editing through Efficient Finetuning | Jun 6, 2024 | Benchmarkingknowledge editing | —Unverified | 0 |
| Statistical Multicriteria Benchmarking via the GSD-Front | Jun 6, 2024 | Benchmarking | —Unverified | 0 |
| A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection | Jun 5, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation | Jun 5, 2024 | BenchmarkingImage Segmentation | —Unverified | 0 |
| Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs | Jun 4, 2024 | BenchmarkingFairness | —Unverified | 0 |
| Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check | Jun 4, 2024 | BenchmarkingRepresentation Learning | —Unverified | 0 |
| Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance | Jun 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 0 |
| Analyzing the Feature Extractor Networks for Face Image Synthesis | Jun 4, 2024 | BenchmarkingImage Generation | CodeCode Available | 0 |
| MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset | Jun 4, 2024 | Benchmarking | CodeCode Available | 0 |
| ACCORD: Closing the Commonsense Measurability Gap | Jun 4, 2024 | BenchmarkingCommon Sense Reasoning | CodeCode Available | 0 |
| TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability | Jun 4, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 0 |
| LanEvil: Benchmarking the Robustness of Lane Detection to Environmental Illusions | Jun 3, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection | Jun 3, 2024 | Action RecognitionBenchmarking | —Unverified | 0 |
| R2C2-Coder: Enhancing and Benchmarking Real-world Repository-level Code Completion Abilities of Code Large Language Models | Jun 3, 2024 | BenchmarkingCode Completion | —Unverified | 0 |
| Scaffold Splits Overestimate Virtual Screening Performance | Jun 2, 2024 | BenchmarkingClustering | —Unverified | 0 |
| WebSuite: Systematically Evaluating Why Web Agents Fail | Jun 1, 2024 | BenchmarkingDiagnostic | CodeCode Available | 0 |
| On the project risk baseline: integrating aleatory uncertainty into project scheduling | May 31, 2024 | BenchmarkingScheduling | —Unverified | 0 |
| Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images | May 30, 2024 | AllBenchmarking | —Unverified | 0 |
| CoSy: Evaluating Textual Explanations of Neurons | May 30, 2024 | Benchmarking | —Unverified | 0 |
| MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification | May 29, 2024 | Benchmarking | —Unverified | 0 |
| Categorization of 33 computational methods to detect spatially variable genes from spatially resolved transcriptomics data | May 29, 2024 | BenchmarkingSpecificity | —Unverified | 0 |
| Exploring Thermography Technology: A Comprehensive Facial Dataset for Face Detection, Recognition, and Emotion | May 28, 2024 | BenchmarkingEmotion Recognition | —Unverified | 0 |
| Risk-Neutral Generative Networks | May 28, 2024 | Benchmarking | —Unverified | 0 |
| A Correlation- and Mean-Aware Loss Function and Benchmarking Framework to Improve GAN-based Tabular Data Synthesis | May 27, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarking General-Purpose In-Context Learning | May 27, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases | May 25, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| BOLD: Boolean Logic Deep Learning | May 25, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| NuwaTS: a Foundation Model Mending Every Incomplete Time Series | May 24, 2024 | BenchmarkingContrastive Learning | —Unverified | 0 |
| MCDFN: Supply Chain Demand Forecasting via an Explainable Multi-Channel Data Fusion Network Model | May 24, 2024 | BenchmarkingDemand Forecasting | —Unverified | 0 |
| Application based Evaluation of an Efficient Spike-Encoder, "Spiketrum" | May 24, 2024 | BenchmarkingClassification | —Unverified | 0 |
| Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks | May 24, 2024 | BenchmarkingDecoder | —Unverified | 0 |
| Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study | May 24, 2024 | BenchmarkingVulnerability Detection | —Unverified | 0 |
| Full-stack evaluation of Machine Learning inference workloads for RISC-V systems | May 24, 2024 | BenchmarkingDeep Learning | —Unverified | 0 |
| Benchmarking Hierarchical Image Pyramid Transformer for the classification of colon biopsies and polyps in histopathology images | May 24, 2024 | BenchmarkingClassification | —Unverified | 0 |
| Free Performance Gain from Mixing Multiple Partially Labeled Samples in Multi-label Image Classification | May 24, 2024 | BenchmarkingData Augmentation | —Unverified | 0 |
| A Gap in Time: The Challenge of Processing Heterogeneous IoT Data in Digitalized Buildings | May 23, 2024 | BenchmarkingData Integration | —Unverified | 0 |
| An Empirical Study of Training State-of-the-Art LiDAR Segmentation Models | May 23, 2024 | Autonomous DrivingBenchmarking | —Unverified | 0 |
| CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models | May 22, 2024 | BenchmarkingHallucination | —Unverified | 0 |
| EXACT: Towards a platform for empirically benchmarking Machine Learning model explanation methods | May 20, 2024 | BenchmarkingExplainable artificial intelligence | —Unverified | 0 |
| CT-Eval: Benchmarking Chinese Text-to-Table Performance in Large Language Models | May 20, 2024 | BenchmarkingDiversity | —Unverified | 0 |
| DispaRisk: Auditing Fairness Through Usable Information | May 20, 2024 | BenchmarkingBias Detection | CodeCode Available | 0 |
| EnviroExam: Benchmarking Environmental Science Knowledge of Large Language Models | May 18, 2024 | BenchmarkingSpecificity | —Unverified | 0 |
| From Generalist to Specialist: Improving Large Language Models for Medical Physics Using ARCoT | May 17, 2024 | BenchmarkingMultiple-choice | —Unverified | 0 |
| SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge | May 17, 2024 | BenchmarkingSocial Media Popularity Prediction | —Unverified | 0 |
| BraTS-Path Challenge: Assessing Heterogeneous Histopathologic Brain Tumor Sub-regions | May 17, 2024 | BenchmarkingPrognosis | —Unverified | 0 |
| An Integrated Framework for Multi-Granular Explanation of Video Summarization | May 16, 2024 | BenchmarkingPanoptic Segmentation | CodeCode Available | 0 |
| Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail Promotions | May 16, 2024 | BenchmarkingReinforcement Learning (RL) | CodeCode Available | 0 |
| A Robust Autoencoder Ensemble-Based Approach for Anomaly Detection in Text | May 16, 2024 | Anomaly DetectionBenchmarking | —Unverified | 0 |
| SpeechVerse: A Large-scale Generalizable Audio Language Model | May 14, 2024 | Automatic Speech RecognitionBenchmarking | —Unverified | 0 |