| Official-NV: An LLM-Generated News Video Dataset for Multimodal Fake News Detection | Jul 28, 2024 | BenchmarkingFake News Detection | —Unverified | 0 |
| On the Evaluation Consistency of Attribution-based Explanations | Jul 28, 2024 | Benchmarking | CodeCode Available | 0 |
| OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation | Jul 26, 2024 | BenchmarkingDocument AI | CodeCode Available | 1 |
| Benchmarking Dependence Measures to Prevent Shortcut Learning in Medical Imaging | Jul 26, 2024 | Benchmarking | CodeCode Available | 0 |
| Towards a Multidimensional Evaluation Framework for Empathetic Conversational Systems | Jul 26, 2024 | Benchmarking | —Unverified | 0 |
| VoxSim: A perceptual voice similarity dataset | Jul 26, 2024 | BenchmarkingSpeaker Recognition | CodeCode Available | 1 |
| AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents | Jul 26, 2024 | BenchmarkingCode Generation | CodeCode Available | 3 |
| ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks | Jul 26, 2024 | BenchmarkingModel Selection | CodeCode Available | 1 |
| SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images | Jul 25, 2024 | Benchmarking | —Unverified | 0 |
| GermanPartiesQA: Benchmarking Commercial Large Language Models for Political Bias and Sycophancy | Jul 25, 2024 | Benchmarking | —Unverified | 0 |
| Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care | Jul 25, 2024 | BenchmarkingDiagnostic | CodeCode Available | 1 |
| AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction | Jul 25, 2024 | BenchmarkingDeep Learning | CodeCode Available | 1 |
| Building a Domain-specific Guardrail Model in Production | Jul 24, 2024 | BenchmarkingLanguage Modelling | —Unverified | 0 |
| Quality Assured: Rethinking Annotation Strategies in Imaging AI | Jul 24, 2024 | Benchmarking | —Unverified | 0 |
| HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation | Jul 24, 2024 | BenchmarkingHuman Animation | CodeCode Available | 3 |
| MOMAland: A Set of Benchmarks for Multi-Objective Multi-Agent Reinforcement Learning | Jul 23, 2024 | BenchmarkingDecision Making | CodeCode Available | 2 |
| COALA: A Practical and Vision-Centric Federated Learning Platform | Jul 23, 2024 | BenchmarkingContinual Learning | CodeCode Available | 2 |
| Flexible Generation of Preference Data for Recommendation Analysis | Jul 23, 2024 | BenchmarkingRecommendation Systems | CodeCode Available | 0 |
| Can time series forecasting be automated? A benchmark and analysis | Jul 23, 2024 | BenchmarkingDecision Making | —Unverified | 0 |
| Hi-EF: Benchmarking Emotion Forecasting in Human-interaction | Jul 23, 2024 | Benchmarking | CodeCode Available | 0 |
| BONES: a Benchmark fOr Neural Estimation of Shapley values | Jul 23, 2024 | Benchmarking | CodeCode Available | 0 |
| AbdomenAtlas: A Large-Scale, Detailed-Annotated, & Multi-Center Dataset for Efficient Transfer Learning and Open Algorithmic Benchmarking | Jul 23, 2024 | BenchmarkingTransfer Learning | CodeCode Available | 3 |
| Aggregated Attributions for Explanatory Analysis of 3D Segmentation Models | Jul 23, 2024 | BenchmarkingSegmentation | CodeCode Available | 0 |
| InLUT3D: Challenging real indoor dataset for point cloud analysis | Jul 22, 2024 | BenchmarkingScene Understanding | —Unverified | 0 |
| Unlocking the Potential: Benchmarking Large Language Models in Water Engineering and Research | Jul 22, 2024 | Benchmarking | —Unverified | 0 |
| Benchmarks as Microscopes: A Call for Model Metrology | Jul 22, 2024 | Benchmarkingmodel | —Unverified | 0 |
| Cascaded two-stage feature clustering and selection via separability and consistency in fuzzy decision systems | Jul 22, 2024 | BenchmarkingClustering | —Unverified | 0 |
| LCA-on-the-Line: Benchmarking Out-of-Distribution Generalization with Class Taxonomies | Jul 22, 2024 | BenchmarkingOut-of-Distribution Generalization | CodeCode Available | 1 |
| HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning | Jul 22, 2024 | BenchmarkingHallucination | CodeCode Available | 1 |
| Open-CD: A Comprehensive Toolbox for Change Detection | Jul 22, 2024 | BenchmarkingChange Detection | —Unverified | 0 |
| StylusAI: Stylistic Adaptation for Robust German Handwritten Text Generation | Jul 22, 2024 | BenchmarkingText Generation | —Unverified | 0 |
| Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA | Jul 22, 2024 | BenchmarkingContrastive Learning | CodeCode Available | 0 |
| Non-Reference Quality Assessment for Medical Imaging: Application to Synthetic Brain MRIs | Jul 20, 2024 | BenchmarkingDomain Adaptation | —Unverified | 0 |
| POGEMA: A Benchmark Platform for Cooperative Multi-Agent Pathfinding | Jul 20, 2024 | BenchmarkingHeuristic Search | CodeCode Available | 1 |
| Benchmarking deep learning models for bearing fault diagnosis using the CWRU dataset: A multi-label approach | Jul 19, 2024 | BenchmarkingBinary Classification | —Unverified | 0 |
| OCTrack: Benchmarking the Open-Corpus Multi-Object Tracking | Jul 19, 2024 | BenchmarkingMulti-Object Tracking | —Unverified | 0 |
| Realistic Evaluation of Test-Time Adaptation Algorithms: Unsupervised Hyperparameter Selection | Jul 19, 2024 | BenchmarkingModel Selection | —Unverified | 0 |
| Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations | Jul 19, 2024 | BenchmarkingFairness | CodeCode Available | 1 |
| ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? | Jul 19, 2024 | BenchmarkingCode Generation | CodeCode Available | 7 |
| Vision-Based Power Line Cables and Pylons Detection for Low Flying Aircraft | Jul 19, 2024 | BenchmarkingTransfer Learning | —Unverified | 0 |
| SHS: Scorpion Hunting Strategy Swarm Algorithm | Jul 19, 2024 | Benchmarking | —Unverified | 0 |
| Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance | Jul 18, 2024 | Benchmarking | —Unverified | 0 |
| RT-Pose: A 4D Radar Tensor-based 3D Human Pose Estimation and Localization Benchmark | Jul 18, 2024 | 3D Human Pose EstimationBenchmarking | —Unverified | 0 |
| Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Jul 18, 2024 | BenchmarkingLanguage Modeling | —Unverified | 0 |
| Restore Anything Model via Efficient Degradation Adaptation | Jul 18, 2024 | 5-Degradation Blind All-in-One Image RestorationBenchmarking | CodeCode Available | 1 |
| Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease | Jul 18, 2024 | Benchmarking | CodeCode Available | 0 |
| Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data | Jul 17, 2024 | ArticlesBenchmarking | —Unverified | 0 |
| Temporal receptive field in dynamic graph learning: A comprehensive analysis | Jul 17, 2024 | BenchmarkingDynamic Link Prediction | CodeCode Available | 0 |
| Abstraction Alignment: Comparing Model-Learned and Human-Encoded Conceptual Relationships | Jul 17, 2024 | Benchmarking | CodeCode Available | 0 |
| Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models? | Jul 17, 2024 | BenchmarkingSarcasm Detection | —Unverified | 0 |