| Ducho 2.0: Towards a More Up-to-Date Unified Framework for the Extraction of Multimodal Features in Recommendation | Mar 7, 2024 | BenchmarkingMultimodal Recommendation | CodeCode Available | 1 |
| Dissecting Sample Hardness: A Fine-Grained Analysis of Hardness Characterization Methods for Data-Centric AI | Mar 7, 2024 | Benchmarking | CodeCode Available | 0 |
| Three Revisits to Node-Level Graph Anomaly Detection: Outliers, Message Passing and Hyperbolic Neural Networks | Mar 6, 2024 | Anomaly DetectionBenchmarking | CodeCode Available | 0 |
| Comparison Performance of Spectrogram and Scalogram as Input of Acoustic Recognition Task | Mar 6, 2024 | Benchmarking | CodeCode Available | 0 |
| A Density-Guided Temporal Attention Transformer for Indiscernible Object Counting in Underwater Video | Mar 6, 2024 | BenchmarkingCrowd Counting | —Unverified | 0 |
| BAIT: Benchmarking (Embedding) Architectures for Interactive Theorem-Proving | Mar 6, 2024 | Automated Theorem ProvingBenchmarking | —Unverified | 0 |
| Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem | Mar 6, 2024 | BenchmarkingHallucination | CodeCode Available | 0 |
| InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents | Mar 5, 2024 | BenchmarkingLanguage Modeling | CodeCode Available | 2 |
| Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation | Mar 5, 2024 | BenchmarkingIn-Context Learning | —Unverified | 0 |
| Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering | Mar 5, 2024 | BenchmarkingCode Generation | —Unverified | 0 |
| Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground | Mar 4, 2024 | Benchmarking | —Unverified | 0 |
| SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis | Mar 4, 2024 | BenchmarkingDrug Discovery | CodeCode Available | 2 |
| REAL-Colon: A dataset for developing real-world AI applications in colonoscopy | Mar 4, 2024 | Benchmarking | CodeCode Available | 2 |
| Classification of the Fashion-MNIST Dataset on a Quantum Computer | Mar 4, 2024 | BenchmarkingQuantum Machine Learning | —Unverified | 0 |
| Model Lakes | Mar 4, 2024 | BenchmarkingManagement | —Unverified | 0 |
| Fast Benchmarking of Asynchronous Multi-Fidelity Optimization on Zero-Cost Benchmarks | Mar 4, 2024 | Benchmarking | CodeCode Available | 0 |
| a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification | Mar 3, 2024 | BenchmarkingSpeaker Verification | CodeCode Available | 0 |
| A Bayesian Committee Machine Potential for Oxygen-containing Organic Compounds | Mar 2, 2024 | BenchmarkingPosition | —Unverified | 0 |
| Benchmarking Segmentation Models with Mask-Preserved Attribute Editing | Mar 2, 2024 | AttributeBenchmarking | CodeCode Available | 1 |
| SINDy vs Hard Nonlinearities and Hidden Dynamics: a Benchmarking Study | Mar 1, 2024 | Benchmarking | —Unverified | 0 |
| Beyond Single-Model Views for Deep Learning: Optimization versus Generalizability of Stochastic Optimization Algorithms | Mar 1, 2024 | BenchmarkingStochastic Optimization | —Unverified | 0 |
| Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance | Mar 1, 2024 | BenchmarkingStance Detection | —Unverified | 0 |
| Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models | Mar 1, 2024 | BenchmarkingMathematical Reasoning | —Unverified | 0 |
| TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs | Mar 1, 2024 | Benchmarking | CodeCode Available | 1 |
| Imitation Learning Datasets: A Toolkit For Creating Datasets, Training Agents and Benchmarking | Mar 1, 2024 | BenchmarkingImitation Learning | —Unverified | 0 |